DoorDash’s May 12th Outage

May 13, 2022 9 Minute Read General 6

Ryan Sokol

Ryan Sokol serves as the Vice President of Engineering at DoorDash, where he leads all functions for the engineering team. Prior to joining DoorDash, Ryan led and scaled Uber Eats from its inception, overseeing a team of 250 engineers and serving on the Uber Eats executive leadership team. While at Uber, Ryan also led Uber’s Marketplace Platform team where he oversaw Core Dispatch and Uber’s primary application gateway. Before Uber, Ryan was Head of Engineering at Voxer and held various roles at Genentech, IBM and smaller technology consultancies including his own. Ryan sits on the executive team at DoorDash. He holds a B.A. in Economics from the University of California, Los Angeles, and currently resides in Orinda, California with his wife and two children.

Reviewing the incident timeline

At 9:40 am, our storage team began a routine operation to reduce the capacity of our delivery service database. The delivery service database is a critical dependency of our order and delivery flows. Several months earlier, we had completed the migration of delivery data from a legacy database to a fully distributed database and now wanted to reduce the cluster capacity to be more efficient. When downsizing a database cluster, data is redistributed between nodes, and query latency is expected to increase marginally, but not enough to cause an impact to our service. Downsizing in small increments is an operation that we had performed numerous times before without issues, so we didn’t anticipate any problems.

At 10:29 am, our logistics team and some of their calling services received alerts for elevated p99 latency. The delivery service’s latency SLO wasn’t impacted by the increased latency at that time, but at 10:38 am, the storage team made a change to dial down the rate of downsizing data replication to reduce latency, alleviating concerns about continuing the operation. There were related latency alerts again at 11:30 am and 2:45 pm, but knowing the database operation was ongoing, and without impact to the SLO, no action was taken.

At 3:57 pm, our logistics team and their dependent teams received another batch of latency alerts and at 4:04 pm, they were paged to investigate errors. The teams assembled on an incident call at 4:07 pm. Dashers were experiencing significant errors when attempting essential functions such as accepting orders and confirming drop-offs. This impact to Dasher flows was correlated with the delivery service database latency and resulting service errors we had been alerted for. At 4:16 pm, the storage team paused the cluster downsizing operation as an attempt to mitigate the issue.

When the team paused the downsizing operation the incident unexpectedly got much worse. At 4:16 pm, we were alerted by our synthetic monitoring system of failures to www.doordash.com. This alert was a surprise as the website does not have an obvious dependency on our delivery service database. We then began receiving reports of widespread impact to our consumers’ customer experience. Engineers began reporting various issues they were seeing across many different flows and services, including degradation in our Drive product and our ability to fulfill orders. Over the next 45 minutes, we continued investigating the errors and disruptions, but with the widespread issues happening at the same time, we failed to identify a clear signal on what the root problem was. Without this signal, we attempted to mitigate by restarting various logistics services, but this did not help us identify a root cause or resolve the incident.

Unknown to members of the team handling the incident, at 4:12 pm our Traffic team had been alerted that we were hitting our circuit breaker limits in Envoy. The team had been running the Envoy traffic infrastructure for over a year and this was the first time they had received this alert. Given our lack of experience with this error, the magnitude of hitting this limit wasn’t well understood by the team. They investigated and took action by increasing several Envoy configuration limits, but ultimately we were not able to mitigate the system-wide failure with this alone.

At 5:30 pm, we decided to turn off customer traffic at our edge and prevent consumers from using the app to place orders, believing this would allow for our systems to recover and clear request queues. We gradually ramped traffic back up over the course of the next 30 minutes. From 6:00 pm to 6:22 pm there was a partial recovery with 80% of our normal volume of orders from the consumer app being processed.

From then on, we continued seeing an impact to our Dasher customer flows as well as intermittent wider instability of our services. At that point, we understood that Envoy circuit breakers had opened and were likely causing the impact beyond the Dasher flows, so we narrowed focus back to removing the delivery service database latency that we originally detected. We performed various mitigation actions related to the delivery service and its database infrastructure. This included scaling out the number of service instances and database proxy instances to accommodate the increased latency, and restarting some database instances to undo configuration changes made during the incident. This, together with some of the previous mitigation efforts on Envoy, helped us eventually see clear signs of recovery.

At 7:30 pm, our services were finally healthy and stable.

Root cause analysis

Since the incident, engineering teams have spent many hours fully investigating the root cause in great detail. For the purpose of this post the root cause can be summarized as follows:

Our planned database maintenance increased query latency more than we had expected based on prior similar maintenance operations.
Increased database latency caused increased response latency from the delivery service.
Increased delivery service latency had a wide-ranging impact on its dependent services, including timeouts resulting in error responses.
The increased request latency, along with increased traffic due to retries, caused an increase in connection utilization at multiple points within our Envoy infrastructure. We reached limits for both active connections and requests which caused a large proportion of requests passing through Envoy to be rejected (Envoy returned 503 to callers). Because multiple services shared the same Envoy cluster, this broadly impacted customer flows.
In an attempt to mitigate the increased delivery database latency, we modified database timeouts and restarted, which caused delivery service errors and tail latency to worsen. This mitigation attempt together with a surge in customer traffic caused Envoy to reach limits again and the customer experience to further be impacted.

Although this failure started with increased latency from a database operation, hitting the Envoy limits resulted in this incident going from a relatively localized issue to a widespread failure of our systems. Its widespread nature caused a lot of noise in our signals and a more chaotic incident response that extended the incident dramatically.

Learnings and improvements

Here are some of the key issues and follow-ups.

Database infrastructure

Since the incident, we have audited our configuration and usage of this database cluster and now understand that a combination of factors - schema design, suboptimal configuration, and usage patterns - are what caused the additional latency during the resize process. Until those are corrected on this cluster and others, we won’t be performing similar operations. When we resume these operations, we will ensure that we have a better understanding of the latency that our applications can tolerate.

When we migrated delivery data from a legacy database to a new distributed database, we introduced a database proxy to perform the migration and allow for a rollback path. The presence of this proxy added another layer to debug and increased mitigation complexity, so we have expedited its safe removal.

Traffic infrastructure

We had been running Envoy for about 12 months in production. We left the circuit-breaking configuration at the defaults, without sufficient consideration. We lacked a clear understanding of the impact that hitting this limit would have on our customers. Though we were alerted quickly and had detailed dashboards, we also identified some areas of improvement to Envoy observability. We are following up with an audit of all critical Envoy configuration settings, improving our understanding and enhancing our visibility with more metrics and alerts, and enabling distributed tracing. One thing we learned about circuit breakers in Envoy is that they are essentially rate limits and not traditional circuit breakers. This has been a key learning for us.

Multiple services share the same east-west traffic infrastructure. In this configuration, the circuit breaker is shared for all traffic passing through it. This means that a single service under heavy connection or request load can cause the circuit breaker to open for other services as well. As part of immediate incident follow up, we have effectively disabled shared circuit breaking. We will next look at traffic routers per domain for better failure isolation.

Conclusion

We want to again sincerely apologize to our community of Customers, Dashers, and Merchants who count on us to access opportunity and convenience. As an engineering team, we would also like to give credit to and thank the teams across DoorDash who worked to handle support cases, issue refunds, and help to make things right for our customers.

We will learn from this incident and apply those learnings to improve our infrastructure and reliability in 2022 and beyond.

Comments

Leave a Reply Cancel reply

Launching the DoorDash Platform with DoorDash Drive

From the very start, DoorDash was founded with the goal of being the local logistics layer for every city. To get there, we began by building a consumer-facing marketplace focused on possibly the most complicated item to deliver correctly: food. Over the past three years we’ve been learning from millions of deliveries, training our data models, ...

Abhay Sukumaran 4 Minute Read

General

Future-proofing: How DoorDash Transitioned from a Code Monolith to a Microservice Architecture

In 2019, DoorDash’s engineering organization initiated a process to completely reengineer the platform on which our delivery logistics business is based. This article represents the first in a series on the DoorDash Engineering Blog recounting how we approached this process and the challenges we faced. In traditional web application development, engineers write code, compile it, ...

Cesare Celozzi 21 Minute Read

General

Using a Decision Engine to Power a First Class Customer Experience

DoorDash's decision engine empowers customer service agents to deliver consistent, effective solutions for customer issues.

Preetha Vijaya Saraswathi

Derrick Hu

Kevin Nguyen

Alex Liu 8 Minute Read

Backend General

Examining Problematic Memory in C/C++ Applications with BPF, perf, and Memcheck

As applications grow in complexity, memory stability is often neglected, causing problems to appear over time. When applications experience consequences of problematic memory implementations, developers may find it difficult to pinpoint the root cause. While there are tools available that automate detecting memory issues, those tools often require re-running the application in special environments, resulting ...