Between 16:30 PDT and 18:40 PDT on June 19th 2021, DoorDash experienced a system-wide failure for approximately two hours that saddled merchants with undelivered meals, rendered Dasher’s unable to accept new deliveries or check in for new shifts, and left consumers unable to order food or receive their placed orders in a timely fashion via our platform. The cause was a cascading failure of multiple components in DoorDash’s platform, which put extreme load on our internal payments infrastructure, eventually causing it to fail. The current analysis shows no leading indication prior to the incident triggering at 16:30 PDT, with mitigation taking significantly longer than the standard we aim to hold ourselves to as an engineering team.
We fully understand our responsibility as a commerce engine whose primary aim is to grow and empower local economies. As an engineering team, we firmly believe that reliability is our number one feature, and in this instance we failed, plain and simple. Customers will be refunded for orders that were cancelled as a result of this outage. Merchants will be paid for orders that were cancelled during this period, as well. And Dashers will be compensated for orders they were unable to complete and any ratings that were less than five stars during the outage will be removed from their Dasher history. For those interested in the technical causes, subsequent mitigation and prevention initiatives that we’ve undertaken since the incident, please feel free to read on.
The impact of the outage
From 16:30 PDT to 18:36 PDT, most Dashers were unable to accept new deliveries or check in for their shifts, which significantly degraded DoorDash’s delivery fulfillment capabilities. As a result, at 17:19 PDT DoorDash took action to halt new orders from being placed by our customers and at 17:22 PDT implemented the same action for DoorDash’s Drive partners. Upon finding and fixing the cause of the incident, DoorDash subsequently re-enabled Drive partner’s order placing capability at 18:22 and re-enabled full ordering capability for DoorDash customers between 18:32 PDT and 18:39 PDT.
All times PDT on 06/19/2021
- 16:30 Latency for some internal payments APIs begin to rise.
- 16:30 Memory and CPU for internal payments related deployments begin to rise.
- 16:30 Dasher related services begin to exhibit increased latency and the Dasher app begins to present errors to Dashers that prevent them from accepting new orders and checking in for new shifts.
- 16:35 System wide alerts begin to trigger with engineers being paged.
- 16:40 Payments systems are scaled out by 50% in an attempt to alleviate CPU and memory pressure.
- 16:59 Payments systems were restarted but no sustained recovery was realized.
- 17:01 Cascading failures increase call volumes on payments by five times the normal levels.
- 17:19 DoorDash halted all new consumer orders.
- 17:22 DoorDash Drive was disabled for merchant partners.
- 18:12 The Engineering team was able to pinpoint the source of increased traffic within Dasher systems which was in turn putting pressure on our payments services.
- 18:12 All traffic to Dasher systems was stopped at the network layer to allow systems to recover.
- 18:20 All traffic was re-enabled to Dasher systems at the network layer but problems re-emerged.
- 18:22 DoorDash Drive was re-enabled.
- 18:25 Config was deployed to Dasher systems to prevent downstream payment calls which alleviated the cascading failures.
- 18:26 All traffic to Dasher systems was stopped at the network layer for a second time to allow systems to recover.
- 18:28 All traffic was re-enable to Dasher systems at the network layer.
- 18:29 Dasher and payment system health sustained.
- 18:32 Consumer ordering is re-enabled for 25%.
- 18:37 Consumer ordering is re-enabled for 50%.
- 18:38 Consumer ordering is re-enabled for 100%.
Root cause analysis
Starting at 16:30 PDT on 6/19/2021, the payments infrastructure began to exhibit high latency when fetching data required by the Dasher App and its supporting systems. While teams were diagnosing this high latency and the resulting failures, retry attempts from the Dasher systems compounded the issue as additional traffic caused by these retries strained the payment infrastructure that was already unhealthy. This led to Dashers not being able to fulfill orders, causing poor experiences for all consumers, Dashers and merchants. Though we have defined and documented best practices for the interaction between components that would help us mitigate these scenarios, the components involved in this incident (payments and Dasher) did not have these patterns implemented.
A root cause of the issue is the lack of defensive programming techniques such as load shedding and circuit breaking designed to protect distributed systems like ours from the catastrophic failures like what we were experiencing. In this case, the server (payments infrastructure) lacked the implementation of load shedding, which would have prevented it from collapsing due to elevated request volume as a result of higher latencies. The client (Dasher App and systems) lacked the implementation of circuit breaking, which should have triggered to temporarily bypass its invocation of an unhealthy downstream dependency.
The DoorDash engineering team has spent every hour since the conclusion of the incident implementing corrective actions and defensive coding practices while actively investigating the origins of the payments infrastructure original trigger.
DoorDash’s corrective actions
The first change introduced a load shedding mechanism to the payments infrastructure that triggered the incident. This mechanism, deployed successfully to production at 6/20/2021 07:36 PDT, armed the payments infrastructure with the ability to gracefully shed inbound request volume exceeding thresholds beyond operating capabilities.
The second change introduced circuit breaking mechanisms to the Dasher infrastructure and operating playbook. This mechanism enables Dasher infrastructure to bypass its dependency on the payments infrastructure in the event of service instability. With these changes we are confident that our Dasher infrastructure can withstand similar downstream instability with little to no system-wide impact.
The third action is a comprehensive audit of payment infrastructure’s interfaces and APIs to ensure that sufficient documentation exists and that the upstream call graph is well understood and fully diagnosable.
We believe these immediate changes will help prevent similar events from occurring and pledge to use this moment to complete a comprehensive audit of our systems to ensure that best practices and operational knowledge is well distributed and implemented. In time, we hope to regain the trust from those that we’ve lost and as always will aim to be 1% better every day.