Carl

Carl Mastrangelo

A programming and hobby blog.


Health Checking Best Practices

When it’s time to productionize your service, you will need to consider how to tell if your service is running. In this post, we’ll look at how Health Checking works, and some of the possible trade offs.

What is a Health Check?

Before we go too deep let’s first talk about what a health check is and is not. I claim:

A health check is a single probe to a service that tells whether the service can handle requests.

This can be as simple as opening a TCP socket to the server, or it could be an API request. It could even be just checking if the process is running. Health checks solve a number of systemic problems:

Health Checks and Keep-Alives

These two seemingly similar concepts are often used interchangeably, but they actually serve different purposes. A health check tells whether the service can handle requests, while a keep-alive tells whether a particular client is still connected to a server. Keep-alives solve a number of network problems:

As we can see, the difference is that keep-alives are scoped to the connection, while health checks are system-wide. Generally, you want to have both. A future post will talk about how to configure keep-alives.

Probing

While not the subject of this post, probing is an important part of overall system stability. Like health checking, it too tells if your service is up. However, probing is usually done across all instances of your service, rather than to a single instance. In other words, probes are issued to a load-balanced target, either region-wide or globally. Additionally, the response to a failed probe is different than a failed health check. A failed probe will trigger alerting and notify someone, while a failed health check may just restart a server. Lastly, probes are either white-box (checks the response matches) or black-box (any response is good). Health checks are always black-box.

Option 1: Point to Point Health Checks

In the simplest case, a health check is a keep-alive. Your client can send a request to your server periodically to see if it’s still willing to take more traffic. If the server ever becomes unhealthy, the client can close the connection, and try connecting to a different server.

health1.png

Point to point health checks let the client query the server to see if it should still receive traffic. This includes all interested parties, such as your Kubernetes container. Your server exposes a “health checking” end point (such as a gRPC service defined in your .proto file), which can be called by anyone. For example, in Kubernetes this is an HTTP GET request to “/healthz”.

Pros

Cons

The problems with point to point health checking are more obvious as your service is under higher load, and not at all obvious when you start out. This is the scary part, since you don’t know about the problems until it’s too late, and you are in the middle of a crisis.

Minor Cons

There are a few other issues that depend on how point to point health checks are implemented. The following points may not apply, but they are worth noting:

Option 2: Centralized Health Checks

To get around some of the problems with the Point to Point model, we can define a centralized health checking service. Rather than clients asking servers if the server is healthy, a single service can query each server. The health checking service passes this information to the load balancer, which can then decide to remove servers from the pool, or add them back when they are ready. At startup, clients query the load balancer to get a list of healthy servers. Because the unhealthy servers won’t be present, clients avoid connecting to servers that don’t want traffic.

health2.png

While operationally more complex, this does solve many of the above problems. Let’s list them:

Pros

Cons

Variations

If your architecture does not include a separate load balancer service, it is possible to have clients query the central health checker service directly. This still has some of the benefits of offloading work from the servers, and limitting access. The client can combine the health data with the list of servers it knows about to decide who to connect to.

It may also be possible to report the server healthiness directly to the load balancer, rather than have a separate health checker service. This avoids the operation complexity of an additional binary, in favor of increasing the load balancer responsibility.

Make Health Checks Look Like Real Requests

Regardless of which option you go with, strongly consider using a real, idempotent, request to check health. Doing so raises the fidelity of the response, because it exercises the same code paths a normal request would.

If making a custom request is difficult, consider making the server self issue a real request. For example, if our key-value service gets a health check request, it could issue a lookup request back to itself to see if it is healthy. Thus, each server is also its own client in a way. Do make sure if you take this approach, to propagate the original requestors identity and credentials, so that your server doesn’t escalate the privilege of the health check.

Always make sure to use authenticated requests. If normal client traffic has to include OAuth tokens to make queries, then the health checks should include them too. Despite our best efforts, sometimes requests do have side effects. This is a security risk that you don’t need. Additionally, if the auth checks fail, it means your clients will likely get failure responses too.

Consider using gRPC!

gRPC has support for a standardized central health checker. It includes a protobuf definition which can be used. gRPC also has full support for keep-alives, so you can get both at the same time.

Conclusion

Health checking is tricky to do correctly, but can be tamed by having the right service setup. Using a centralized health checking service with service-specific health checks provides the most useful, stable, healthiness data.


Home

You can find me on Twitter @CarlMastrangelo