How to Handle Kubernetes Health Checks

August 9, 2022 7 Minute Read Backend 27

Andres Ivanov

Andres is a software engineer at DoorDash

Our health checks outage on Black Friday

Because the DoorDash team lacked a deep technical understanding of the health checks we were running, we experienced an informative incident on Black Friday, a typically busy holiday for DoorDash.

Closer to the end of the day, our engineers received various alerts that our Tier 0 service was experiencing issues. Specifically we saw:

CPU utilization spiking
Increased response Latencies
SLOs were burning
Reports were coming in from other services with failing dependencies

Our incident management tooling allowed us to quickly assemble the incident response team with relevant counterparts and start diagnosing triggers to mitigate the impact.

We were able to assemble an approximate diagnosis and the timelapse:

Large amounts of Pods were failing the readiness probes and were removed from the Kubernetes Service
The remaining Pods were quickly overwhelmed by being forced to handle a majority of the requests, skyrocketing CPU

To mitigate the initial impact, we disabled the readiness health checks on the Pods and service functionality was restored.

Understanding what failed

After the service was back to serving traffic normally, we had the opportunity to look into what exactly happened and what action items needed to be completed to avoid the problem in the future. Typically when dealing with an outage of this nature, it’s important to look at metrics, traces, and logs. In this section, we will outline our examination of each of these.

Looking at metrics to narrow down the search

Generally, the first place we look during an outage is our metrics. Metrics tend to provide the data on what is failing and in what amount, e.g.: a particular endpoint is returning 503 error codes and is doing so 90% of the time. For this particular outage, our metrics indicated only an overall increase in latency on all endpoints and the failing of a Kubernetes readiness check, which didn’t narrow it down to a particular failure. Given that the metrics were not providing much insight, the next step was to take a look at traces.

Using traces to track down individual application requests

After narrowing down the failure to a particular endpoint or determining that the metrics were not helpful, the next step is looking at the traces. Traces provide in-depth information on what functions were executed during a single request. In comparison, checking logs is often more difficult because they can be poorly formatted, and it can be very challenging to find issues without knowing exactly what to look for.

To analyze our Black Friday incident further, we looked at our tracing data. What we found was that health check endpoints were excluded from the reporting tracing data. This meant that the tracing data was not going to help us find what caused the health check to fail and we would need to check the logs as the next logical step.

How we used logs to find what happened

Since we were not able to find the cause of the health check failures with metrics or tracing data, we needed to turn our attention to logs. From looking at the traces, we know that health check endpoints were also ignored from Application Logs, making them less useful in this case. However, we also had logs from our eBPF agent, a piece of software that runs adjacent to all our services and collects data on all TCP, UDP, and DNS requests performed by the service.

From these, we found a latency increase towards one of our Redis servers and a drop in request volume towards it when we disabled readiness checks. What was important, however, was that the latency increase on the given Redis server was on a legacy path to be removed and should not have impacted our application. Nevertheless, it turns out the health check endpoint/path used in our readiness check was a default health check provided by the Spring Boot Framework. The provided default health check endpoint performed various different configurable smaller health checks, enabled by default, one of them being Redis.

Testing our outage theory

After determining the hypothesized origin of the failure, the theory must be confirmed and then verified upon implementation of the solution. To test our theory, we used Chaos Engineering, which allows injection of failure into various aspects of a system in order to proactively find failures that can negatively impact the system. For this case, we have used Litmus Chaos, an open-source Chaos Engineering platform that enables the injection of various failures through targeted experiments. We configured an experiment called Pod Network Latency, which added a one-second latency to all calls to the Redis server, previously identified as a source of health check failures. One second was chosen because our readiness check timeout was also set to the same value. With the experiment enabled, we saw readiness checks starting to fail in a similar manner as during the outage.

Documenting action items

Once all the sources of failure have been found and confirmed, it’s important to create tickets and document all action Items to avoid similar problems in the future. In our case, we first worked on configuring the health endpoint provided by Spring Boot so that it only performed checks on relevant dependencies. Then we documented the findings and proactively reached out to every team/service using Spring Boot to help mitigate similar issues. Additionally, we have also started an initiative to document behaviors of various Kubernetes health checks and share this knowledge across the organization.

Common health check pitfalls

Based on our findings during this incident and further learnings, we wanted to share what we think are common pitfalls with Kubernetes health checks and actions we recommend to avoid them.

Lack of understanding of what different Kubernetes probes do
Use of 3rd party provided health check endpoints with no knowledge into what actions they perform
Disabled observability around aspects of health checks, including logs, traces, and metrics.

Lessons learned about Kubernetes probes

Throughout this project, our team identified gaps in our knowledge and procedures, and we determined measures to ensure our health checks are more effective and efficient. Here are our recommended steps to avoid similar issues:

1. Understand the different applications of the various Kubernetes probes. Ensure the entire department is aware of these use cases.

2. Verify the applications and options regarding any third-party health check endpoints. Consider disabling certain features on third-party tools.

3. Treat health check endpoints as Tier 0 by instrumenting them with various observability methods and ensuring they are not ignored by the observability tooling. If health checks are providing too much data, consider sampling them or reducing the volume of data they share.

4. Having a health-checks depend on a backend dependency can be problematic, as an outage within your dependency can cause you to have an outage as Kubernetes restarts your containers.

Comments

Leave a Reply Cancel reply

Related Positions

Autonomy Engineer, Behavior Prediction San Francisco, CA Autonomy Engineer, Platform San Francisco, CA Autonomy Engineer, Remote Assistance - Labs San Francisco, CA See All Jobs

Rebuilding and Migrating a Session Management System with Zero Downtime

Migrating DoorDash's business-critical session management system in a disruption-free manner required careful planning and monitoring.

Sin Ko

Li Pei 18 Minute Read

Backend Mobile

Improving Development Velocity with Generic, Server-Driven UI Components

Learn how Generic server driven UI enabled faster iterations and more experimentation on the DoorDash platform

Ashwin Kachhara

Alice Hyun

Afshin Dehghani 27 Minute Read

Backend General

How to leverage functional programming in Kotlin to write better, cleaner code

Is functional programming a good paradigm to use for Kotlin development? Read this guide for direct coding comparisons between FP and OOP

Jerry Liu 20 Minute Read

Backend

Platform Optimization Through Better API Design

As DoorDash migrated to a microservices architecture, we found an opportunity to redesign our APIs, resulting in better overall client performance.

Maggie Fang 10 Minute Read

Backend

Enabling Faster Financial Partnership Integrations Using Cadence

Read the technology review we conducted to find the right task management technology for Dashpass onboarding. Learn why we chose Cadence

Wenhan Shen

Lev Neiman 10 Minute Read

Backend

How We Scaled New Verticals Fulfillment Backend with CockroachDB

To address the scalability issues in moving to new verticals our team migrated from PostgreSQL to CockroachDB as its new storage engine.

Yin Zhang

Nikhil Pujari

Kevin Chen

ThulasiRam Peddineni 14 Minute Read

Backend General

Enforce Timeout: A DoorDash Reliability Methodology

“What would happen if we removed statement timeouts in our Postgresql databases?” That’s one of the questions asked in a management meeting. At the time I only responded that it would be bad — it would cause problems and make it harder to debug. However, I realize now that this is a topic that many people don’t ...

Zhaobang Liu 5 Minute Read

Backend

Building a More Reliable Checkout Service at Scale with Kotlin

In 2020, DoorDash engineers extracted the consumer order checkout flow out of our monolithic service and reimplemented it in a new Kotlin microservice service. This effort, part of our migration from a monolithic codebase to a microservices architecture, increases our platform’s performance, reliability, and scalability. The consumer checkout flow is one of the most critical ...

Yimin Wei

Zhengli Sun

Amiraj Dhawan 11 Minute Read

Backend

Tailoring Gradle and Docker for Rapid Local Development

As technology companies race to release their next features, any delay in productivity can be extremely detrimental, making an efficient development build process essential. Companies that use Kubernetes and Docker in production environments most likely use Docker for local development. Docker-compose, a tool for defining and running multi-container Docker applications, ensures consistent local development processes, ...

Marvin Flores 6 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

How to Handle Kubernetes Health Checks

Andres Ivanov

Recent Posts

Our health checks outage on Black Friday

Understanding what failed

Looking at metrics to narrow down the search

Using traces to track down individual application requests

How we used logs to find what happened

Testing our outage theory

Documenting action items

Common health check pitfalls

Lessons learned about Kubernetes probes

Leave a Reply Cancel reply

Popular Posts

Related Positions

You May Also Like

Rebuilding and Migrating a Session Management System with Zero Downtime

Improving Development Velocity with Generic, Server-Driven UI Components

How to leverage functional programming in Kotlin to write better, cleaner code

Platform Optimization Through Better API Design

Enabling Faster Financial Partnership Integrations Using Cadence

How We Scaled New Verticals Fulfillment Backend with CockroachDB

Enforce Timeout: A DoorDash Reliability Methodology

Building a More Reliable Checkout Service at Scale with Kotlin

Tailoring Gradle and Docker for Rapid Local Development