A Better Base 58 Encoding
Base 58 is an encoding scheme with a better usability than Base 64. Base 58 offers several ease of use improvements:
- No punctuation in the output.
- Double clicking the encoded text highlights all of it
- Avoids similar looking characters
Base 58 is most commonly seen in Bitcoin addresses, where it has grown in popularity. While Base 58 is slightly less information-dense than Base 64, Base 58 “fits” in more places, and is easier for humans to read. (Similar in nature to Base 32, which I describe in Let’s Make a Varint.)
Before we get into the problems of Base 58, I have included a JS implementation of the improved Base 58 encoder. Try it out!
- Base 58 Encoding:
- Efficiency (input bits / output bits )
Base 58 brings some complications. Power-of-2 bases are very fast to output, as the input text is also a power of two: e.g. a sequence of 8 bit bytes. Encoding the text can be seen as a change of base between base 256 to the new base. For Base 64, this can be quickly by using bit shifting and masking. For Base 58 though, we need to use division. Division is much slower than bit operations, but is unavoidable. The Base 58 encoding scheme uses long division to achieve the change of base.
This means that converting an
n byte sequence to the Base 58 format is a quadratic runtime operation! This makes it useful only for very small pieces of text (e.g. Bitcoin wallet addresses) and impractical for use in most other places. There are a number of problems with the implementation:
- Encoding is very slow. (O(n2))
- Encoding requires implementing a complex, change of base converter.
- Either the code must include an arbitrary precision integer library, which may not be available
- … or implement the complex long division code by hand. (example)
Encoding Better - NTRU Prime
In an ideal world, we would be able to use the smaller alphabet size of Base 58, but avoid the costly quadratic conversion and complex code. Adam Langley describes in detail an algorithm called NTRU Prime encoding. The post describes an encoding that is able change base, without the slow long division. The idea is that instead of encoding the output as the minimal possible representation, small amounts of non uniformity in the digits is okay. Adjusting his example from Base 10 to Base 58, this means that not every digit has a uniformly equal probability distribution. (and as a result, doesn’t have log258 bits of entropy per character.)
However, the algorithm can be tuned based on a “comfort” margin of non uniformity. In the JS toy above, the “Base 58 Uniformity Comfort Limit” parameter changes how entropy dense the encoding is. The higher the limit, the more likely the encoder will avoid outputting a digit with a non-uniform distribution. The lower the value, the faster it outputs.
Thus, the better Base 58 Encoder uses the the same Base 58 Alphabet of characters, but uses a much faster algorithm. Notably:
- Linear time (O(n)) conversion from original text to Base 58 text.
- Easy to read code, with many fewer edge cases
- Easy to parallelize code since the encoded text avoids dependencies on the adjacent characters
- Tunable efficiency (almost always above 99.8% of the max entropy coding.
There are 2 downsides to this scheme:
- The “comfort” limit needs to be known in advance by readers and writers.
- There are multiple encoded forms of the same input, so it’s not possible to compare values without decoding first.
Base 58 is a useful encoding scheme, but let’s use the fast encoder and decoder to process Base 58 text.
The formal algorithm is described in the NTRU Prime submission to the Post-Quantum cryptography contest. Originally shared by djb.
I won’t even link to the draft Base 58 encoding RFC which tried to standardize it. It is woefully under-specified, and not ready to turn into workable code. I had to scour GitHub to find out how it was actually implemented after hours of failed attempts. I want to save you that wasted time.
In addition to the JS encoder included above, I have a working Python encoder and decoder.