Carl

Carl Mastrangelo

A programming and hobby blog.


Accidental Class Loading in Java

Recently I’ve been working on improving the ergonomics of my tracing library PerfMark. One of the interesting things I noticed is that the JVM was loading some classes I didn’t expect it to. Let’s find out why!

“That’s Weird”

PerfMark works on much older JVM’s, but strives to be as fast as possible. To accomplish this, the code bootstraps itself based on what advanced JVM features are available (MethodHandles, VarHandles, Hidden Classes, etc.). Because of this, the library needs to avoid loading classes that aren’t usable. In the event that no advanced features are available, the library safely disables itself. The code shows how to load optionally available classes safely:

// SecretPerfMarkImpl.PerfMarkImpl

static {
  Generator gen = null;
  Throwable problem = null;
  try {
    Class clz = 
      Class.forName(
        "io.perfmark.java7.SecretMethodHandleGenerator$MethodHandleGenerator");
    gen = clz.asSubclass(Generator.class).getConstructor().newInstance();
  } catch (Throwable t) {
    problem = t;
  }
  if (gen != null) {
    generator = gen;
  } else {
    generator = new NoopGenerator();
  }
  // More Initialization code.
}

As we can see, the static initializer tries to load a special class that requires Java 7 or higher. If it isn’t available, we fall back to the no-op implementation. The actual types of the classes are not as important, except that Generator is an abstract class. We can check to see what is actually loaded by passing the

-Xlog:class+load=info

flag to the JVM. This lets us see what classes are loaded and when:

[1.685s][info][class,load] io.perfmark.PerfMark 
[1.693s][info][class,load] io.perfmark.Impl 
[1.694s][info][class,load] io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl
[1.695s][info][class,load] java.util.concurrent.ConcurrentHashMap$ForwardingNode 
[1.697s][info][class,load] java.lang.AssertionError 
[1.701s][info][class,load] io.perfmark.impl.Generator 
[1.702s][info][class,load] io.perfmark.impl.NoopGenerator
[1.704s][info][class,load] io.perfmark.Tag
[1.706s][info][class,load] io.perfmark.Link
[1.707s][info][class,load] io.perfmark.impl.SecretPerfMarkImpl
[1.709s][info][class,load]
  io.perfmark.java7.SecretMethodHandleGenerator$MethodHandleGenerator
[1.712s][info][class,load] java.lang.Long$LongCache
[1.714s][info][class,load] java.lang.invoke.LambdaForm$MH/0x00000008000c5000
  source: __JVM_LookupDefineClass__
[1.718s][info][class,load] java.lang.invoke.LambdaForm$MH/0x00000008000c5400
  source: __JVM_LookupDefineClass__
[1.721s][info][class,load] java.lang.invoke.LambdaForm$BMH/0x00000008000c5800 
  source: __JVM_LookupDefineClass__

This is kind of strange. The MethodHandleGenerator class is definitely available, but it loads after the NoopGenerator. Worse, it seems like both classes end up being loaded. What’s going on?

“When Flags Aren’t Enough”

Let’s ratchet up the verbosity to see what the loader is doing:

-Xlog:class+resolve=debug

Running with this shows that the class loading it caused by the verification step the JVM performs when loading classes:

[1.826s][info ][class,load   ] io.perfmark.Impl 
[1.826s][debug][class,resolve] io.perfmark.Impl java.lang.Object (super)
[1.827s][debug][class,resolve]
  java.lang.Class io.perfmark.Impl Class.java:382 (explicit)
[1.828s][info ][class,load   ] io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl 
[1.828s][debug][class,resolve]
  io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl io.perfmark.Impl (super)
[1.828s][debug][class,resolve]
  java.lang.Class io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl Class.java:382 (explicit)
[1.828s][debug][class,resolve] io.perfmark.Impl java.lang.Throwable (verification)
[1.832s][debug][class,resolve] io.perfmark.Impl java.lang.AssertionError  (verification)
[1.833s][debug][class,resolve]
  io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl java.lang.Throwable
[1.835s][info ][class,load   ] io.perfmark.impl.Generator 
[1.835s][debug][class,resolve] io.perfmark.impl.Generator java.lang.Object (super)
[1.836s][debug][class,resolve] java.lang.Class io.perfmark.impl.Generator (explicit)
[1.836s][debug][class,resolve]
  io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl io.perfmark.impl.Generator (verification)
[1.837s][info ][class,load   ] io.perfmark.impl.NoopGenerator 
[1.837s][debug][class,resolve] io.perfmark.impl.NoopGenerator io.perfmark.impl.Generator (super)
[1.837s][debug][class,resolve] java.lang.Class io.perfmark.impl.NoopGenerator (explicit)
[1.838s][info ][class,load   ]
  Found Noop: io.perfmark.impl.NoopGenerator
[1.838s][debug][class,resolve] 
  io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl io.perfmark.impl.NoopGenerator (verification)
[1.838s][debug][class,resolve]
 io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl java.lang.Object (verification)
[1.838s][debug][class,resolve]
 io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl java.lang.Object (verification)
[1.847s][debug][class,resolve]
 io.perfmark.impl.SecretPerfMarkImpl$PerfMarkImpl java.util.concurrent.atomic.AtomicLong
[1.848s][info ][class,load   ]
 io.perfmark.java7.SecretMethodHandleGenerator$MethodHandleGenerator 

While the JVM is well documented, it is hard to penetrate for someone who isn’t making their own implementation. What we want to know is why this class is relevant to verification. Rather than go over the Specification with a fine tooth comb, let’s just put a breakpoint into the JVM itself!

Instrumenting the JVM

First, let’s get a copy:

git clone https://github.com/openjdk/jdk.git

After fumbling with the configuration arguments, let’s try out a slowdebug build.

bash configure \
  --with-boot-jdk=/home/carl/.gradle/jdks/adoptium-19-x64-hotspot-linux/jdk-19.0.1+10/ \
  --with-debug-level=slowdebug  \
  --enable-headless-only \
  --with-native-debug-symbols=internal

I (ab)used the JDK that Gradle downloaded for me, but the rest of the configuration is pretty regular. I am using slowdebug and with-native-debug-symbols because for some reason GDB was unable to find the function names in the back trace. I used --enable-headless-only because I don’t have all the header files locally. Okay, let’s build!

CONF=linux-x86_64-server-slowdebug make

This takes about 8 minutes on my Skylake processor. Soon enough, we have a fully functional JDK. Because this is a hack, I modified the java command Gradle builds for me manually to call GDB.:

-   exec "$JAVACMD" "$@"
+   exec gdb --args "$JAVACMD" "$@"

Finally, let’s see that it works:

$ JAVA_HOME=~/git/jdk/build/linux-x86_64-server-slowdebug/jdk/ \
  ./build/install/perfmark-examples/bin/perfmark-examples

GNU gdb (Debian 12.1-4+b1) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
    .

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/carl/git/jdk/build/linux-x86_64-server-slowdebug/jdk//bin/java...
(gdb) 

Alright, We can boot our JVM with GDB, and we are ready to see what it’s doing. However, we need a way to see when the NoopGenerator class really get’s loaded. After some walking through the code, I find that verification happens in a file called verifier.cpp. Let’s add a print in Verifier::trace_class_resolution:

void Verifier::trace_class_resolution(
    Klass* resolve_class, InstanceKlass* verify_class) {
  assert(verify_class != nullptr, "Unexpected null verify_class");
  ResourceMark rm;
  Symbol* s = verify_class->source_file_name();
  const char* source_file = (s != nullptr ? s->as_C_string() : nullptr);
  const char* verify = verify_class->external_name();
  const char* resolve = resolve_class->external_name();

  if (strstr(resolve, "NoopGenerator") != nullptr) {
    log_info(class, load)("Found NoopGenerator: %s", resolve);
  }

  // print in a single call to reduce interleaving between threads
  if (source_file != nullptr) {
    log_debug(class, resolve)(
      "%s %s %s (verification)", verify, resolve, source_file);
  } else {
    log_debug(class, resolve)("%s %s (verification)", verify, resolve);
  }
}

As of this writing, this happens on line 129. Let’s rebuild and re-run:

CONF=linux-x86_64-server-slowdebug make
$ JAVA_HOME=~/git/jdk/build/linux-x86_64-server-slowdebug/jdk/ \
  ./build/install/perfmark-examples/bin/perfmark-examples

(gdb) 

At this point, we need to make GDB not pester us as we are stepping through. The JVM uses Segmentation Faults to implement efficient NullPointerException calls, so we want to avoid being notified of that. It also uses other signals (i.e. SIGUSR2) for thread pausing and resuming, which we aren’t interested in:

Reading symbols from /home/carl/git/jdk/build/linux-x86_64-server-slowdebug/jdk//bin/java...
(gdb) handle SIGSEGV noprint nostop
Signal        Stop  Print   Pass to program Description
SIGSEGV       No    No  Yes     Segmentation fault
(gdb) handle SIGUSR2 noprint nostop
Signal        Stop  Print   Pass to program Description
SIGUSR2       No    No  Yes     User defined signal 2
(gdb) 

Okay, let’s insert a breakpoint:

(gdb) break verifier.cpp:129
No source file named verifier.cpp.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (verifier.cpp:129) pending.
(gdb) 

Alright, let’s run!

(gdb) run
Starting program: /home/carl/git/jdk/build/linux-x86_64-server-slowdebug/jdk/bin/java
//   .... Lot's of output
Thread 2 "java" hit Breakpoint 1, Verifier::trace_class_resolution 
    (resolve_class=0x8000c44d8, verify_class=0x8000c4000) 
    at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:129
129     log_info(class, load)("Found NoopGenerator: %s", resolve);
(gdb) 

Now that we have hit out break point, let’s see how the class loader actually got here.

(gdb) bt
#0  Verifier::trace_class_resolution (resolve_class=0x8000c44d8, verify_class=0x8000c4000) at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:129
#1  0x00007ffff6c99e82 in VerificationType::resolve_and_check_assignability (klass=0x8000c4000, name=0x7ffff05705c8, from_name=0x7ffff0570d48, from_field_is_protected=false, from_is_array=false, 
    from_is_object=true, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/classfile/verificationType.cpp:85
#2  0x00007ffff6c9a055 in VerificationType::is_reference_assignable_from (this=0x7ffff571b5a0, from=..., context=0x7ffff571c870, from_field_is_protected=false, __the_thread__=0x7ffff0032830)
    at /home/carl/git/jdk/src/hotspot/share/classfile/verificationType.cpp:122
#3  0x00007ffff6b1827f in VerificationType::is_assignable_from (this=0x7ffff571b5a0, from=..., context=0x7ffff571c870, from_field_is_protected=false, __the_thread__=0x7ffff0032830)
    at /home/carl/git/jdk/src/hotspot/share/classfile/verificationType.hpp:289
#4  0x00007ffff6cad7ad in StackMapFrame::pop_stack (this=0x7ffff571bdf0, type=..., __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/classfile/stackMapFrame.hpp:236
#5  0x00007ffff6ca8321 in ClassVerifier::verify_field_instructions (this=0x7ffff571c870, bcs=0x7ffff571bd80, current_frame=0x7ffff571bdf0, cp=..., allow_arrays=true, __the_thread__=0x7ffff0032830)
    at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:2367
#6  0x00007ffff6ca439a in ClassVerifier::verify_method (this=0x7ffff571c870, m=..., __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:1693
#7  0x00007ffff6c9d1c6 in ClassVerifier::verify_class (this=0x7ffff571c870, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:645
#8  0x00007ffff6c9b499 in Verifier::verify (klass=0x8000c4000, should_verify_class=true, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/classfile/verifier.cpp:201
#9  0x00007ffff6444437 in InstanceKlass::verify_code (this=0x8000c4000, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/oops/instanceKlass.cpp:752
#10 0x00007ffff6444a91 in InstanceKlass::link_class_impl (this=0x8000c4000, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/oops/instanceKlass.cpp:873
#11 0x00007ffff64444c2 in InstanceKlass::link_class (this=0x8000c4000, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/oops/instanceKlass.cpp:758
#12 0x00007ffff644539d in InstanceKlass::initialize_impl (this=0x8000c4000, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/oops/instanceKlass.cpp:1027
#13 0x00007ffff64443a2 in InstanceKlass::initialize (this=0x8000c4000, __the_thread__=0x7ffff0032830) at /home/carl/git/jdk/src/hotspot/share/oops/instanceKlass.cpp:740
#14 0x00007ffff659bc52 in find_class_from_class_loader (env=0x7ffff0032b50, name=0x7ffff0572b78, init=1 '\001', loader=..., protection_domain=..., throwError=0 '\000', __the_thread__=0x7ffff0032830)
    at /home/carl/git/jdk/src/hotspot/share/prims/jvm.cpp:3537
#15 0x00007ffff658c395 in JVM_FindClassFromCaller (env=0x7ffff0032b50, name=0x7ffff571edd0 "io/perfmark/impl/SecretPerfMarkImpl$PerfMarkImpl", init=1 '\001', loader=0x7ffff571ef08, caller=0x7ffff571ef00)
    at /home/carl/git/jdk/src/hotspot/share/prims/jvm.cpp:825
#16 0x00007ffff54bff29 in Java_java_lang_Class_forName0 (env=0x7ffff0032b50, this=0x7ffff571eef0, classname=0x7ffff571ef18, initialize=1 '\001', loader=0x7ffff571ef08, caller=0x7ffff571ef00)
    at /home/carl/git/jdk/src/java.base/share/native/libjava/Class.c:145
#17 0x00007fffe855aaad in ?? ()
#18 0x0000000000000002 in ?? ()
#19 0x00007fffbc187288 in ?? ()
#20 0x0000000000000000 in ?? ()

I highlighted the relevant methods. It seems like the JVM is keen on making sure that NoopGenerator is assignable to Generator, in the event that line is executed. With this is mind, we can guess how to fix it.

Cheating the Verifier

So far, the class hierarchy looks something like:

public abstract class Generator {
  // ...
}
final class NoopGenerator extends Generator {
  // ...
}

Recall the original function that triggers this behavior:

// SecretPerfMarkImpl.PerfMarkImpl

static {
  Generator gen = ...
  if (gen != null) {
    generator = gen;
  } else {
    generator = new NoopGenerator();
  }
  // More Initialization code.
}

The JVM wants to be sure that in case NoopGenerator is loaded, it will actually be Generator. The problem is that it is eagerly doing the verification of the type, even if it is seldom loaded. With this in mind, we can come up with a solution.

Warning: Hacks

Do not write the following code without a pretty beefy comment:

static {
  Generator gen = ...
  if (gen != null) {
    generator = gen;
  } else {
    generator = (Generator) (Object) new NoopGenerator();
  }
  // More Initialization code.
}

This double cast (Object masking?) modifies the byte code to doubt itself about it’s assignment. Note the change in the byte code:

       176: aload_0
       177: ifnonnull     196
       180: new           #3                  // class io/perfmark/impl/NoopGenerator
       183: dup
       184: invokespecial #4                  // Method io/perfmark/impl/NoopGenerator."":()V
       187: checkcast     #2                  // class io/perfmark/impl/Generator
       190: putstatic     #26                 // Field generator:Lio/perfmark/impl/Generator;
       193: goto          200
       196: aload_0
       197: putstatic     #26                 // Field generator:Lio/perfmark/impl/Generator;
       200: iconst_0
       201: istore_2
       202: iconst_0
       203: istore_3
       204: ldc           #108                // String io.perfmark.PerfMark.startEnabled
       206: invokestatic  #109                // Method java/lang/Boolean.getBoolean:(Ljava/lang/String;)Z

This minor change causes the verifier to not double check the type until runtime. I won’t re-paste all the debug output, but let’s see that the change had an effect. We will use JMH with the classloader profiler to measure the changes:

BEFORE                                               
Benchmark                                             Mode  Cnt     Score     Error        Units
forName_init                         ss  400  8730.958 ± 142.833        us/op
forName_init:·class.load             ss  400     1.166 ±   0.015  classes/sec
forName_init:·class.load.norm        ss  400    72.105 ±   0.112   classes/op
forName_init:·class.unload           ss  400       ≈ 0            classes/sec
forName_init:·class.unload.norm      ss  400       ≈ 0             classes/op

AFTER
Benchmark                                             Mode  Cnt     Score     Error        Units
forName_init                         ss  400  8411.036 ± 148.305        us/op
forName_init:·class.load             ss  400     1.136 ±   0.016  classes/sec
forName_init:·class.load.norm        ss  400    71.040 ±   0.113   classes/op
forName_init:·class.unload           ss  400       ≈ 0            classes/sec
forName_init:·class.unload.norm      ss  400       ≈ 0             classes/op

As we can see the number of classes loaded drops by about one. We won’t read too much into the speed up, since the error bars are pretty high already.

Conclusion

The JVM checks our safety by making sure our classes are sound, but sometimes we want to defer those checks until later. This post shows how to diagnose such cases, and how to avoid doing unnecssary class loads.


Becoming an Extrovert

In 2022, I made the painful transition from introvert to extrovert. It has greatly improved my life by making me feel more confident and comfortable in social settings. While the fear of talking to new people still lingers on, I can now override that little voice inside. Several people have expressed interest in my conversion, and I wanted to document how I changed. If you are looking for an avenue for self improvement that is challenging and rewarding, I’d recommend giving this a try.

Introverts

As the old joke goes, introverts look at their shoes when talking, but extroverts look at your shoes. I originally thought the truth was along these lines, where introverts don’t talk to other people, but extroverts do. When I talked to people about it, a lot of people claimed to be introverts, even though they’re social and engaging people.

Introverts do not get energy talking to other people. It’s not black and white but this is the litmus test for if you are introverted. Prior to 2022, I was like this. However, it manifested in less obvious ways. For example:

A superficial understanding would be that these sentiments are about not wanting to talk with others. However, a more introspective survey would show (at least for me) that these insecurities and confidence issues. The surface level logic makes sense, and no one can really challenge that (e.g. you might actually be interrupting someone) at face value. I would rebut that it’s more about fear of rejection.

If you want to:

Then, being an introvert will not serve you. This is the conclusion I came to at the beginning of 2022. I decided that the awkward and cringy experiences would ultimately be worth it for my own personal development. I needed to venture from one side of the social chasm to the other.

Extroverts

Extroverts absorb energy talking to other people. Not only do they get ramped up talking and learning about others, every moment in the conversation makes them want it more. Have you ever felt drained after attending a social obligation? Have you ever thought “Uggh I’ve been out talking to people all day, I need a break”? Extroverts do not feel this way.

I can see the internal changes inside myself. It’s a mindset thing; the first thought when seeing other people changes. Notice the difference in narrative:

If that last one bothers you, don’t let it! My own pendulum swung so far into the introversion side before I started my journey. In order to push myself to feel comfortable, I had to override my own internal monologue to be bold enough to talk to others. Remember, it’s a free country! It’s not illegal to talk to the human beings around you. The extrovert thinks: “if they don’t want to talk to me, they will let me know [via verbal or body language cues]“.

One of the major downsides I noticed of being extroverted is that we (they) feel a desire to talk to others. Some people don’t want to talk, but we have an unfulfilled need. As a result, extroverts are under constant rejection every day. They have to talk to others; their nature compels them to initiate. Would you rather live in a world where you never talk to anyone, or a world where no one wants to talk to you? I think the extrovert scenario is the more painful one!

How to Make the Leap

Several mindset changes were needed for me to make the change. I won’t sugar-coat it, it’s going to be a painful and possibly embarrassing experience. Here is what I internalized:

I had (and maybe still have?) a bad habit of trying to followup on everything someone else says. It left no room in the dialogue for the other person to open up: to be heard. It’s painful to have that silence, but I’ve noticed other people will reveal themselves more if you leave them room. And remember, they hate that silence too!

Results

I took a Myers Briggs test at the beginning of 2022, and then again around August. I answered honestly how I really felt and what I would do. The test picked up the change from IXXX to EXXX which was kind of amazing to me.

Additionally, my confidence in myself has risen dramatically. Other people have noticed as well. It’s not fully complete, but it’s far enough long that I can see what is needed to finish.

Confidence and comfort to talk to others is held inside like a sieve. It gets refilled constantly by talking to others, which is how extroverts rally. I feel like I can last for longer and longer in conversation, and at some point I won’t be tired by it. A friend of mine said he could talk all day long (like 8 hours) and hunger for more at the end of it. This is the level I am going for and the changes in attitude I listed above are will what will get me there.


Why Does gRPC Insist on Trailers?

gRPC comes up occasionally on the Orange Site, often with a redress of grievences in the comment section. One of the major complaints people have with gRPC is that it requires HTTP trailers. This one misstep has caused so much heartache and trouble, I think it probably is the reason gRPC failed to achieve its goal. Since I was closely involved with the project, I wanted to rebut some misconceptions I see posted a lot, and warn future protocol designers against the mistakes we made.

Mini History of gRPC’s Origin.

gRPC was reared by two parents trying to solve similar problems:

  1. The Stubby team. They had just begun the next iteration of their RPC system, used almost exclusively throughout Google. It handled 1010 queries per second in 2015. Performance was a key concern.
  2. The API team. This team owned the the common infrastructure serving (all) public APIs at Google. The primary value-add was converting REST+JSON calls to Stubby+Protobuf. Performance was a key concern.

The push to Cloud was coming on strong from the top, and the two teams joined forces to ease the communication from the outside world, to the inside. Rather than boil the ocean, they decided to reuse the newly minted HTTP/2 protocol. Additionally, they chose to keep Protobuf as the default wire format, but allow other encodings too. Stubby had tightly coupled the message, the protocol format, and custom extensions, making it impossible to open source just the protocol.

Thus, gRPC would allow intercommunication between browsers, phones, servers, and proxies, all using HTTP semantics, and without forcing the entirety of Google to change message formats. Since message translation is no longer needed, high speed communication between endpoints is tractable.

HTTP, HTTP/1.1, and HTTP/2

HTTP is about semantics: headers, messages, and verbs.
HTTP/1.1 is a mix of a wire format, plus the semantics (RFCs 7231-7239). gRPC tries to keep the HTTP semantics, while upgrading the wire format. Around 2014-15, SPDY was being tested by Chrome and GFE as a work around for problems with HTTP/1.1. Specifically:

Acting on the promising improvements seen in the SPDY experimentation, the protocol was formalized into HTTP/2. HTTP/2 only changes the wire format, but keeps the HTTP semantics. This allows newer devices to downgrade the wire format when speaking with older devices.

As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keeps the semantics of the web, while WS does not. Additionally, WebSockets suffers from the same head-of-line blocking problem HTTP/1.1 does.

Those Contemptible Trailers

Most people do not know this, but HTTP has had trailers in the specification since 1.1. The reason they are so uncommonly used is because most user agents don’t implement them, and don’t surface them to the JS layer.

Several events happened around the same time, which lead to the bet on requiring trailers:

The thinking went like this:

  1. Since we are using a new protocol, any devices that use it will need to upgrade their code.
  2. When they upgrade their code, they will need to implement trailer support anyways.
  3. Since HTTP/2 mandates TLS, it is unlikely middleboxes will error on unexpected trailers.

Why Do We Need Trailers At All?

So far, we’ve only talked about if it’s possible to use trailers, not if we should use them? It’s been over two decades, and we haven’t needed them yet, why put such a big risk into the gRPC protocol?

The answer is that it solves an ambiguity. Consider the following HTTP conversation:

GET /data HTTP/1.1
Host: example.com

HTTP/1.1 200 OK
Server: example.com

abc123

In this flow, what was the length of the /data resource? Since we don’t have a Content-Length, we are not sure the entire response came back. If the connection was closed, does it mean it succeeded or failed? We aren’t sure.

Since streaming is a primary feature of gRPC, we often will not know the length of the response ahead of time. HTTP aficionados are probably feeling pretty smug right now: “Why don’t you use Transfer-Encoding: chunked?” This too is insufficient, because error can happen late in the response cycle. Consider this exchange:

GET /data HTTP/1.1
Host: example.com

HTTP/1.1 200 OK
Server: example.com
Transfer-Encoding: chunked

6
abc123
0

Suppose that the server was in the middle of streaming a chat room message back to us, and there is a reverse proxy between our user agent and the server. The server sends chunks back to us, but after sending the first chunk of 6, the server crashes. What should the Proxy send back to us? It’s too late to change the response code from 200 to 503. If there were pipelined requests, all of them would need to be thrown away too. If this proxy wanted to keep the connection open (remember connections cost a lot to make), it would not want to terminate it, for an arguably recoverable scenario.

Hopefully this illustrates the ambiguity between successful, complete responses, and a mic-drop. What we need is a clear sign the response is done, or a clear sign there was an error.

Trailers are this final word, where the server can indicate success or failure in an unambiguous way.

Trailers for JSON v.s. Protobuf

While gRPC is definitely not Protobuf specific, it was created by people who have been burned by Protobuf’s encoding. The encoding of Protobuf probably had a hand in the need for trailers, because it’s not obvious when a Proto is finished. Protobuf messages are a concatenation of Key-Length-Values. Because of this structure, it’s possible to concatenate 2 Protos together and it still be valid. The downside of this is that there is no obvious point that the message is complete. An example of the problem:

syntax = "proto3";
message DeleteRequest {
   string id = 1;
   int32 limit = 2;
}

The wire format for an example message looks like:

Field 1: "zxy987"
Field 2: 1

A program can override a value by adding another field on:

Field 2: 1000

The concatenation would be:

Field 1: "zxy987"
Field 2: 1
Field 2: 1000

Which would be interpreted as:

Field 1: "zxy987"
Field 2: 1000

This makes encoding messages faster, since there is no size field at the beginning of the message. However, there is now a (mis-)feature where Protos can be split or joined along KLV boundaries.

JSON has the upper hand here. With JSON, the message has to end with a curly } brace. If we haven’t seen the finally curly, and the connection hangs up, we know something bad has happened. JSON is self delimiting, while Protobuf is not. It’s not hard to imagine that trailers would be less of an issue, if the default encoding was JSON.

The Final Nail in gRPC’s Trailers

Trailers were officially added to the fetch API, and all major browsers said they would support them. The authors were part of the WHATWG, and worked at the companies that could actually put them into practice. However, Google is not one single company, but a collection of independent and distrusting companies. While the point of this post is not to point fingers, a single engineer on the Chrome team decided that trailers should not be surfaced up to the JS layer. You can read the arguments against it, but the short version is that there was some fear around semantic differences causing security problems. For example, if a Cache-Control header appears in the trailers, does it override the one in the headers?

I personally found this reason weak, and offered a compromise of treating them as semantic-less key-values surfaced up to the fetch layer. Whether it’s because I was wrong, or failed to make the argument, I strongly suspect organizational boundaries had a substantial effect. The Area Tech Leads of Cloud also failed to convince their peers in Chrome, and as a result, trailers were ripped out.

Lessons for Designers

This post hopefully exposed why trailers were included, and why they didn’t work ultimately. I left the gRPC team in 2019, but I still think fondly of what we created. There are gobs of things the team got right; unfortunately this one mistake ended up being the demise. Some takeaways:


More Thoughts:


You can find me on Twitter @CarlMastrangelo