Let’s Make a Varint

Variable length integers, or Varints, are a way of compressing down integers into a smaller space than is normally needed. By default, computers use fixed length integers for reasons of hardware efficiency. However, when transmitting or storing integers, it’s important to compress them down in order to save bandwidth.

Varints are based on the idea that most numbers are not uniformly distributed. Almost always, smaller numbers are more common in computing than larger ones. The trade off that varints make is to spend more bits on larger numbers, and fewer bits on smaller numbers. For example, a 64 bit integer that is almost always less than 256 would be wasting the top 56 bits of a fixed width representation. The two common ways of encoding varints are length prefixed, and continuation bits.

Protocol Buffers

Google’s Protobuf use the latter technique, using the top bit of each byte to indicate whether or not there are more bytes coming. The lower 7 bits of each byte are used to encode the actual number. For example, to encode the number 27:

   Decimal:
   27 = 16 + 8 + 2 + 1

   Binary:
   0001 1011

Notice how the top (left most) bit is 0. This means that there are no more bytes coming. With this technique, numbers 0 - 127 can be encoded in a single byte. What happens however, when there are numbers greater than 127? We will have to split it up into two bytes, with the first byte indicating there is a second. Let’s encode the number 227

   Decimal: 
   227 = 128 + 64 + 32 + 2 + 1

   Binary:
   1110 0011

We can then split up the number into two 7 bit chunks:

   000 0001 and 110 0011

For Protobuf, the least significant group comes first, which means that we should add a continuation bit to the low order group. Finally, we encode the number by reversing the order of the groups:

   0 000 0001 and 1 110 0011
   1110 0011 and 0000 0001

With only 2 bytes, we can encode the number 227. Notice how the top most bit of each byte tells if there are more coming. If the top bit is a 1, we know to keep looking. If it’s a 0, we know it’s the last byte. Decoding happens in reverse order: remove the top most bit, reverse the order of the groups, and concatenate the bits to get the original number back.

This technique is really powerful, since it can encode a number of any size! With a 32 or 64 bit number, you limit yourself to a maximum value. With varints, you can encode big and small numbers together, even if they were originally all large numbers. Even more cool, varints can be concatenated! Because it is always clear where one number ends and another begins, it’s possible to write them out in order. Varints are self delimiting.

Pros:

Byte aligned. No need to pad encoded numbers to a byte boundary.
High efficiency. A 64 bit number can be encoded in at most 10 bytes.

UTF-8

UTF-8 encoding of characters makes use of the former encoding technique by prefixing a length to the number. While normally considered inefficient, the length is encoded in unary. The number of leading 1 bits indicates how many extra bytes are coming:

    0xxx xxxx - 1 byte encoding
    110x xxxx - 2 byte encoding
    1110 xxxx - 3 byte encoding
    1111 0xxx - 4 byte encoding

The prefix 10xx is not used in UTF-8, as it is reserved for the continuation bytes. This allows decoders to be tolerant of data corruption. If any of the bytes are damaged, the decoder can skip over them and find the next byte with a valid prefix. For example, to encode character number 27 in UTF-8, it would look like:

  Binary:
    0001 1011
     gfe dcba
   
  UTF-8
    0gfe dcba
    0001 1011

The character number 227 would look like:

  Binary:
    1110 0011
    hgfe dcba
   
  UTF-8:
    110x xxhg 10fe dcba
    1100 0011 1010 0011

Notice here how the encoding is now with the most significant group first, which differs from the Protobuf approach. There are different merits to either approach, especially involving coding speed.

Pros:

Compatible with existing ASCII characters. UTF-8 can be a general purpose varint, but was designed to be backwards compatible with existing character encoding.
Fast to decode. A decoder know exactly how many bytes need to be read after the first one, rather than checking each on individually.
Error tolerant. Bytes can be delete or mangled and the remaining characters can still be read.
Easy to sort. Because UTF-8 orders by most significant group, comparing encoded byte values is the same as comparing the actual values. (ignoring unicode comparison semantics)

Enough Background, Let’s Make a Varint.

There are a whole bunch of different kinds of variable length integers that I didn’t cover, and barely covered the features and downfalls of the two most popular variable integer codings. The important takeaway is that we can choose what features we want, and design an encoding appropriately. It’s like being at a data-structure buffet!

With that in mind, let’s define the problem, what things we want to be solved, and the design will fall out of that. Recently, I have been working on a picture upload site. Each picture will be given a unique id, and stored on disk somewhere. There could be lots of such pictures (like millions). The site could be running on either Windows or Linux. Finally, the pictures will be accessed over normal HTTP and possibly Webdav.

With that in mind, let’s take a look at ways we can encode the picture’s unique id. Protobuf varints won’t work, because the encoded values might be non textual bytes. Putting random bytes in the URL bar, or in file name won’t be good. UTF-8 looks like it might fit the bill though. It is good at taking a number, and turning it into valid text. The main problem with this is that UTF-8 has been defined to only store numbers up to 1,112,064. (UTF-8 was intended for Unicode, which only has a limited number of code points.) That means that if there are more than a million pictures, most UTF-8 decoders won’t accept the bytes we encode.

Another problem is storing these pictures on disk. File systems behave poorly when there are a lot of files in a single directory. We need some sort of hierarchical representation of the files, preferably something that can be derived from the unique id. A pattern than can be used to solve this is to take the leading bytes of the file name, and make a directory structure based on it. Git does something like this, using the SHA-1 hashes of the files (which are base 16 encoded). For example:

  objects/
    d/
      e/
        9/
          de9f2c7fd25e1b3afad3e85a0bd17d9b100db4b3

This is a good idea, but raises another problem. How many levels should there be? In this example there are 3, which means there will be at most 4096 (16*16*16) directories. With millions of files, there are still going to be hundreds of files per directory. Git plays other tricks to solve this, but that adds a lot more complexity than we need.

A second problem with the Git approach is that there are going to be a lot of directories. SHA-1 hashes are pretty uniformly distributed, meaning that each new file added is likely to get it’s own directory if there aren’t as many files. In the case we don’t have millions of files, this is going to be too much directory overhead. We need the number of directories created, and the number of files per directory to be tightly bounded.

One idea that is popular in encoding integers is to use a limited character set. For example, base64 encoding, while typically used for binary data, works well for encoding varints. It gets 6 bits of data for each 8 bit byte. That’s almost as dense as protobuf varints. Compare that to normal encoding of the number 12345 which is just “12345”, having only 10 possible values per byte rather than 64. It’s also better than hex encoding, which only offers 16 values per byte.

There are two major problems with base64. First: It uses both uppercase and lower case letters to encode values. Since these values are likely going to be file names too, and because Windows doesn’t distinguish uppercase from lowercase letters, it will be really problematic to run there. Secondly, Base64 numbers are aesthetically unpleasing. For example “3p8sf9JeGzr60+haC9F9mxANtLM=” just isn’t pleasant to look at. Sorting such names is also a minor problem, since “Z” is less than “a” in ASCII.

Time for a Solution

If base16 encoding doesn’t have enough characters and base64 has too many, let’s pick base32. Douglas Crockford has a really well thought out base32 character set. Notable features:

No similar looking characters. There is 0 but not O. There is 1 but not l (ell) nor I (eye)
No “u”, (and no “o” or “i”) so accidental profanity cannot happen.
Case folds properly. We can stick with just lower case for encoding, but decode both upper and lower unambiguously.
No special characters. Base64 has either “=”, “-”, “+”, or “/” in it which makes it hard to pronounce.

Crockford really gets the human aspect of computing here. This is a great starting point, so let’s pick our coding to be based on base32, with the character set:

   0123456789abcdefghjkmnpqrstvwxyz

Notice how numbers come first, unlike in base64 alphabets. This means that encoded bytes will sort properly without having to decode, since digits are before letters in ASCII.

Another convenience of this encoding is that there are at most 32 characters possible per byte. This makes it possible to make each character a directory when storing files. There will be at most 32 directories or files per parent. The question of how many directory levels to have is solved: The number is capped to a reasonable number, and implicit from the file name.

Now that we have an Alphabet…

We need to pick a way to encode our numbers into this alphabet. Desirable traits:

Numerically close values should have close encodings.
Comparing encoded values alphabetically should be the same as sorting decoded values numerically.
Must be able to store at least 64 bit numbers.
Must not be ambiguous.

First, close values need to be encoded as close. Notice how 12345 and 12346 share a common prefix? That is a nice property to have when a human is looking at two values. Since pictures will typically get auto incrementing ids, it would be nice if their file name resulted in placement in the same directory.

Second, having encoded values compare the same as decoded values means that you don’t have to decode values (expensive) in order to do some operations. It is much easier to tell if an encoded number will overflow upon decoding than having to decode the number and check. It is as easy as comparing the maximum number encoded to the value to be decoded. For example, if we wanted to check if an encoded number was bigger than 8, we could compare “9” > “8” without ever having to decode it.

Third, it must be able to store the maximum word size of our CPU. Varints can store arbitrarily large values as mentioned above, but for our purpose we can limit ourselves to just 64 bit values if it makes our life easier elsewhere.

Fourth, encoding must not be ambiguous. UTF-8 handles this by saying there cannot be leading zeros. Does the number 000000001 = 1 ? Numerically yes, but as encoded no, since “000000001” != “1”. UTF-8 forbids such values as coding errors. Protobuf does not tackle this issue at all, and allows it to happen.

An unfortunate side effect of ambiguous codings is that it also means inefficiency. There is no longer a 1-1 and onto mapping of numbers to encodings, which means some encodings are wasteful.

The solution here is to simply not have the problem to begin with! Our encoding can just use these otherwise ambiguous encodings as unique values, so there is no problem. Encodings will be dense, and we don’t have to do error checking.

Now that we have all our requirements set out, let’s look at a solution that works nicely.

Carl’s Varint

The encoding will use the alphabet of ASCII characters:

   0123456789abcdefghjkmnpqrstvwxyz

with each byte representing the numbers 0 - 31, respectively. The varint will be split into a prefix and a suffix.

The encoding will be length prefixed, starting from the halfway point of “g”. The prefix byte of “g” through “z” will mean 1 - 16 subsequent bytes, respectively. A prefix byte of “0” - “f” will mean the literal value, same as normal hex encoding. For example:

  "0" = 0
  "1" = 1
  ...
  "a" = 10
  ...
  "f" = "15"

If the prefix byte is in “g” - “z”:

  "g_"  (1 byte suffix)
  "h__"
  "j___"
  ...
  "z_______________" (16 byte suffix)

If a suffix is present, it will be the most significant group first base32 encoding of the number. Furthermore, the suffix is added to the maximum possible value of the previous suffix value. For example, counting up:

  "a" - 10
  "b" - 11
  "c" - 12
  "d" - 13
  "e" - 14
  "f" - 15
  "g0" - 16
  "g1" - 17
  ...
  "gz" - 47
  "h00" - 48

This is the important thing about ambiguity. Notice how “g0” means There is one suffix byte, and that byte is 0 + number of previous values. “h00” is the same way, where the extra 00’s aren’t just wasted space or illegal values like they would be in Protobuf or UTF-8. The downside to this approach is that it means decoding is not as simple as bit shifts. However, because of a convenient identity, the overhead of this turns out to be just an addition by a constant. That is about as expensive as checking for overlong encodings so it turns out to not be that bad.

The prefix is ordered. Any number that starts with “h” will always be strictly larger than any number that starts from “0” to “g”. This is exactly what we want, since it means we can compare encoded numbers. This is something special, since it isn’t merely a side effect of using a most significant encoding. Consider regular numbers, which are encoded with the most significant group first: 8, 9, 10, 11, … If we were to sort these they would be “10”, “11”, “8”, “9”. Special measures would be have to be taken to sort these properly without the length prefix.

The length prefix also gives us the proper file placement we want. For example, the files “0.jpg”, “h00.jpg”, and “h01.jpg” would be stored as:

  objects/
    0.jpg
    h/
      0/
        h00.jpg
        h01.jpg

File will be stored close to each other making it easy to figure out where to look. Directories will never get too large. This is really important when scanning a remote directory, such as with SFTP or Webdav. The number of directories is strictly a function capped by the number of files.

Assuming files were not deleted or changed, it would also be very easy to compare a local and remote directory. Walking the last most directory in a remote file tree will tell you how far ahead or behind the remote directory is.

Taking advantage of the fact that varints can be concatenated, we can do another cool feature: thumbnails. Suppose that each picture has a thumbnail made for it. Each picture has a unique id, and each thumbnail an auto incrementing index. For example, for picture id 49 (encoded as “h01”), and thumbnail index 0 (encoded as “0”), we can make the files “h01.jpg” and “h010.jpg”. The encoded name of the thumbnail is unambiguously parsed as two integers 49 and 0. The thumbnail will be placed in the same directory as the original, and be in the proper sort order compared to other files.

How many values can we store with such a varint? The max number is the sum of all numbers less than “z” numbers, plus all “z” numbers. “z” numbers have 16 bytes, and each byte has 5 bits of entropy, which means we have about 80 bits to play with. This also conveniently can contain 80 bit “long double” numbers that x86 CPUs use.

To find out the starting value for each prefix, we use the identity:

$\sum_{k=1}^{n} x^{k} = {x^{n+1} - x \over {x-1}}$

Since X is 32 and n is 16, and we offset the start by 16, the max value is 1247923426698972051309615.

Side note: 2^80 is 1208925819614629174706176, which is slightly less than the max value here. The reason is because this encoding uses the unambiguous and dense scheme. Dense encoding numbers is an efficiency gain of 1/(x-1) where x is the base of your varint. In our case, the number of total values can be encoded is essentially 2^80 * (1 + ¹⁄₃₁). For Protobuf, which uses base128 varints, using a dense coding scheme would mean a gain of ¹⁄₁₂₇, which is negligible. It is still desirable to use the dense scheme since it avoids error checking.

Varints can encode 64 bit integers easily, but they can also easily encode overly large numbers. To check if a number accidentally overflows while decoding it is difficult. It is much easier and less error prone to compare it to a known max integer, “weyyyyyyyyyyyf”, which can use the existing string comparison. (though, care must be taken to lowercase the number first) Varints can represent negative numbers too, by casting them to unsigned numbers first.

Lastly, encoding and decoding of varints is fast. Because we know the prefix ahead of time, decoding can be implemented in terms of table look ups. On my computer, converting between a number and a varint and back again takes about 200ns. This is not the fastest, but it is good enough. For deeper dive on varint decoding performance, see the excellent presentation by Jeff Dean on Varints in Google’s search indexing.

I have a working, fast, tested, and fuzzed reference implementation of varints made in Go.

There are also a ton of other super cool encoding schemes with various trade offs. Knowing the distribution of your numbers and the properties you want from them are the keys to making your own varint scheme.

Carl Mastrangelo

Let’s Make a Varint

Protocol Buffers

Pros:

UTF-8

Pros:

Enough Background, Let’s Make a Varint.

Time for a Solution

Now that we have an Alphabet…

Carl’s Varint