Carl

Carl Mastrangelo

A programming and hobby blog.


Why Does gRPC Insist on Trailers?

gRPC comes up occasionally on the Orange Site, often with a redress of grievences in the comment section. One of the major complaints people have with gRPC is that it requires HTTP trailers. This one misstep has caused so much heartache and trouble, I think it probably is the reason gRPC failed to achieve its goal. Since I was closely involved with the project, I wanted to rebut some misconceptions I see posted a lot, and warn future protocol designers against the mistakes we made.

Mini History of gRPC’s Origin.

gRPC was reared by two parents trying to solve similar problems:

  1. The Stubby team. They had just begun the next iteration of their RPC system, used almost exclusively throughout Google. It handled 1010 queries per second in 2015. Performance was a key concern.
  2. The API team. This team owned the the common infrastructure serving (all) public APIs at Google. The primary value-add was converting REST+JSON calls to Stubby+Protobuf. Performance was a key concern.

The push to Cloud was coming on strong from the top, and the two teams joined forces to ease the communication from the outside world, to the inside. Rather than boil the ocean, they decided to reuse the newly minted HTTP/2 protocol. Additionally, they chose to keep Protobuf as the default wire format, but allow other encodings too. Stubby had tightly coupled the message, the protocol format, and custom extensions, making it impossible to open source just the protocol.

Thus, gRPC would allow intercommunication between browsers, phones, servers, and proxies, all using HTTP semantics, and without forcing the entirety of Google to change message formats. Since message translation is no longer needed, high speed communication between endpoints is tractable.

HTTP, HTTP/1.1, and HTTP/2

HTTP is about semantics: headers, messages, and verbs.
HTTP/1.1 is a mix of a wire format, plus the semantics (RFCs 7231-7239). gRPC tries to keep the HTTP semantics, while upgrading the wire format. Around 2014-15, SPDY was being tested by Chrome and GFE as a work around for problems with HTTP/1.1. Specifically:

Acting on the promising improvements seen in the SPDY experimentation, the protocol was formalized into HTTP/2. HTTP/2 only changes the wire format, but keeps the HTTP semantics. This allows newer devices to downgrade the wire format when speaking with older devices.

As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keeps the semantics of the web, while WS does not. Additionally, WebSockets suffers from the same head-of-line blocking problem HTTP/1.1 does.

Those Contemptible Trailers

Most people do not know this, but HTTP has had trailers in the specification since 1.1. The reason they are so uncommonly used is because most user agents don’t implement them, and don’t surface them to the JS layer.

Several events happened around the same time, which lead to the bet on requiring trailers:

The thinking went like this:

  1. Since we are using a new protocol, any devices that use it will need to upgrade their code.
  2. When they upgrade their code, they will need to implement trailer support anyways.
  3. Since HTTP/2 mandates TLS, it is unlikely middleboxes will error on unexpected trailers.

Why Do We Need Trailers At All?

So far, we’ve only talked about if it’s possible to use trailers, not if we should use them? It’s been over two decades, and we haven’t needed them yet, why put such a big risk into the gRPC protocol?

The answer is that it solves an ambiguity. Consider the following HTTP conversation:

GET /data HTTP/1.1
Host: example.com

HTTP/1.1 200 OK
Server: example.com

abc123

In this flow, what was the length of the /data resource? Since we don’t have a Content-Length, we are not sure the entire response came back. If the connection was closed, does it mean it succeeded or failed? We aren’t sure.

Since streaming is a primary feature of gRPC, we often will not know the length of the response ahead of time. HTTP aficionados are probably feeling pretty smug right now: “Why don’t you use Transfer-Encoding: chunked?” This too is insufficient, because error can happen late in the response cycle. Consider this exchange:

GET /data HTTP/1.1
Host: example.com

HTTP/1.1 200 OK
Server: example.com
Transfer-Encoding: chunked

6
abc123
0

Suppose that the server was in the middle of streaming a chat room message back to us, and there is a reverse proxy between our user agent and the server. The server sends chunks back to us, but after sending the first chunk of 6, the server crashes. What should the Proxy send back to us? It’s too late to change the response code from 200 to 503. If there were pipelined requests, all of them would need to be thrown away too. If this proxy wanted to keep the connection open (remember connections cost a lot to make), it would not want to terminate it, for an arguably recoverable scenario.

Hopefully this illustrates the ambiguity between successful, complete responses, and a mic-drop. What we need is a clear sign the response is done, or a clear sign there was an error.

Trailers are this final word, where the server can indicate success or failure in an unambiguous way.

Trailers for JSON v.s. Protobuf

While gRPC is definitely not Protobuf specific, it was created by people who have been burned by Protobuf’s encoding. The encoding of Protobuf probably had a hand in the need for trailers, because it’s not obvious when a Proto is finished. Protobuf messages are a concatenation of Key-Length-Values. Because of this structure, it’s possible to concatenate 2 Protos together and it still be valid. The downside of this is that there is no obvious point that the message is complete. An example of the problem:

syntax = "proto3";
message DeleteRequest {
   string id = 1;
   int32 limit = 2;
}

The wire format for an example message looks like:

Field 1: "zxy987"
Field 2: 1

A program can override a value by adding another field on:

Field 2: 1000

The concatenation would be:

Field 1: "zxy987"
Field 2: 1
Field 2: 1000

Which would be interpreted as:

Field 1: "zxy987"
Field 2: 1000

This makes encoding messages faster, since there is no size field at the beginning of the message. However, there is now a (mis-)feature where Protos can be split or joined along KLV boundaries.

JSON has the upper hand here. With JSON, the message has to end with a curly } brace. If we haven’t seen the finally curly, and the connection hangs up, we know something bad has happened. JSON is self delimiting, while Protobuf is not. It’s not hard to imagine that trailers would be less of an issue, if the default encoding was JSON.

The Final Nail in gRPC’s Trailers

Trailers were officially added to the fetch API, and all major browsers said they would support them. The authors were part of the WHATWG, and worked at the companies that could actually put them into practice. However, Google is not one single company, but a collection of independent and distrusting companies. While the point of this post is not to point fingers, a single engineer on the Chrome team decided that trailers should not be surfaced up to the JS layer. You can read the arguments against it, but the short version is that there was some fear around semantic differences causing security problems. For example, if a Cache-Control header appears in the trailers, does it override the one in the headers?

I personally found this reason weak, and offered a compromise of treating them as semantic-less key-values surfaced up to the fetch layer. Whether it’s because I was wrong, or failed to make the argument, I strongly suspect organizational boundaries had a substantial effect. The Area Tech Leads of Cloud also failed to convince their peers in Chrome, and as a result, trailers were ripped out.

Lessons for Designers

This post hopefully exposed why trailers were included, and why they didn’t work ultimately. I left the gRPC team in 2019, but I still think fondly of what we created. There are gobs of things the team got right; unfortunately this one mistake ended up being the demise. Some takeaways:


A Better Base 58 Encoding

Base 58 is an encoding scheme with a better usability than Base 64. Base 58 offers several ease of use improvements:

Base 58 is most commonly seen in Bitcoin addresses, where it has grown in popularity. While Base 58 is slightly less information-dense than Base 64, Base 58 “fits” in more places, and is easier for humans to read. (Similar in nature to Base 32, which I describe in Let’s Make a Varint.)

Before we get into the problems of Base 58, I have included a JS implementation of the improved Base 58 encoder. Try it out!

Base 58 Encoding:
Efficiency (input bits / output bits )

Challenges

Base 58 brings some complications. Power-of-2 bases are very fast to output, as the input text is also a power of two: e.g. a sequence of 8 bit bytes. Encoding the text can be seen as a change of base between base 256 to the new base. For Base 64, this can be quickly by using bit shifting and masking. For Base 58 though, we need to use division. Division is much slower than bit operations, but is unavoidable. The Base 58 encoding scheme uses long division to achieve the change of base.

This means that converting an n byte sequence to the Base 58 format is a quadratic runtime operation! This makes it useful only for very small pieces of text (e.g. Bitcoin wallet addresses) and impractical for use in most other places. There are a number of problems with the implementation:

Encoding Better - NTRU Prime

In an ideal world, we would be able to use the smaller alphabet size of Base 58, but avoid the costly quadratic conversion and complex code. Adam Langley describes in detail an algorithm called NTRU Prime encoding. The post describes an encoding that is able change base, without the slow long division. The idea is that instead of encoding the output as the minimal possible representation, small amounts of non uniformity in the digits is okay. Adjusting his example from Base 10 to Base 58, this means that not every digit has a uniformly equal probability distribution. (and as a result, doesn’t have log258 bits of entropy per character.)

However, the algorithm can be tuned based on a “comfort” margin of non uniformity. In the JS toy above, the “Base 58 Uniformity Comfort Limit” parameter changes how entropy dense the encoding is. The higher the limit, the more likely the encoder will avoid outputting a digit with a non-uniform distribution. The lower the value, the faster it outputs.

Thus, the better Base 58 Encoder uses the the same Base 58 Alphabet of characters, but uses a much faster algorithm. Notably:

There are 2 downsides to this scheme:

  1. The “comfort” limit needs to be known in advance by readers and writers.
  2. There are multiple encoded forms of the same input, so it’s not possible to compare values without decoding first.

Conclusion

Base 58 is a useful encoding scheme, but let’s use the fast encoder and decoder to process Base 58 text.

Notes:

The formal algorithm is described in the NTRU Prime submission to the Post-Quantum cryptography contest. Originally shared by djb.

I won’t even link to the draft Base 58 encoding RFC which tried to standardize it. It is woefully under-specified, and not ready to turn into workable code. I had to scour GitHub to find out how it was actually implemented after hours of failed attempts. I want to save you that wasted time.

In addition to the JS encoder included above, I have a working Python encoder and decoder.


The Impossible Java 11

Over the past two years, I did something no one thought was possible. I updated our code from JDK 8 to JDK 16. Out of all the things I’ve accomplished at Netflix, this is the one thing I’ve had the most questions about, and the most astonishment. “How did you do it?”

If you aren’t familiar with Java, Oracle made a tough decision to modernize Java. Prior to Java 9, a major version came out once every few years, and with relatively minor breaking changes. All that changed in Java 9; major versions were going to be breaking, and coming out once every 6 months.

However, unlike Python 3000, Java’s breaking changes were extremely limited. Only JVM internals would be restricted or removed. That’s it. All the existing language features were still available for use. All the JVM bytecode and classes still worked. Want to use a class written in 1995? If it doesn’t depend on sun.* code, it’s good to go!

But, this is also why so many people thought upgrading was impossible. So much Java code out there had taken liberties with the JDK internals. Want to use sun.misc.Unsafe? It’s yours for the taking. Want to modify final (a.k.a. const) fields? It’s only a little reflection away. Want to inject some classes into the JDK classloader? No one will stop you.

With no teeth in the API boundaries, Java 8 pretty much let application and library owners do what they wanted. Guice, Guava, Jackson, Groovy, Gradle, CGLib, Mockito, Lombok, ErrorProne, and other common Java tech played fast and loose with the guts of the JVM. Yes, their code was faster. But no, it wasn’t a long term win.

This is where that tough decision by someone who works on Java comes into play. Java could improve itself, add features, and improve the VM, but it would come at the cost of breaking the ecosystem. Many libraries, which had claimed wins, would have to give them back. Alternatives would be provided, but everyone would have to coordinate a worldwide, backwards compatible, multi-version upgrade. Every library not wanting to yield their gains would need to be rewritten. Most poignantly though, applications would need to wait till every library they depended on to make this move.

Two things stand out to me about this decision:

  1. Having strong, enforceable API boundaries are necessary for long term improvements. Short term wins, while alluring, usually come with non-standard or unsupported baggage. The stricter it is today, the more maintainable it is in the future.

  2. Taking away functionality, even under the auspices of standards or specifications, is going to hurt. If it used to work, and you break it, it’s your fault.

Java 11

When I joined Netflix, no one told me it was impossible to upgrade from Java 8 to 11. I just started using it. When things didn’t work (and they definitely didn’t!) on 11, I went and checked if I needed to update the library. I did this as a back-burner project, on my own machine, separate from the main repo. One by one, all the non-working libraries were updated to the working ones. When a library was not Java 11 compatible, I filed a PR on GitHub to fix it. And, plain as it sounds, when there are no more broken things, only working things are left!

It was only after I pushed these changes to production, that I found out what I did was impossible. Numerous people asked me how I made this happen. Someone, or some group, had looked at the daunting task of updating hundreds of projects and probably hundreds more libraries, and had given up hope. Like a poison placebo, people’s perception of the impossibility actually made it impossible. Socialize these obstacles and you end up with a whole company finding it insurmountable.

When I told them how I fixed it, they were usually disappointed. “Just updated the libraries?” they would ask, looking for something deeper. Java 11 was time-consuming to upgrade to, but not very hard. The only difference between my approach and theirs was a lack of despair.

I bring up this story to boost the confidence in others that using the latest and greatest is within grasp. A month ago I updated our code to Java 15, and last week to 16. It gets easier each time. Once you are close to the latest version, it’s no challenge to stay there. Since the only breaking changes were hiding JVM internals, and we’re no longer using those, it’s trivial to update. As a reward, we get all the advanced features (better JIT, GC, language features, etc.) that have been delivered over the past years.

I encourage you to take a look at updating too, since it is probably easier than you think!


More Thoughts:


You can find me on Twitter @CarlMastrangelo