Avoiding cache stampede at DoorDash

Avoiding cache stampede at DoorDash

By Zohaib Sibte Hassan, Software Engineer at DoorDash

One of the core mottoes on the engineering team at DoorDash is:

We are only as good as our next delivery!

With high traffic loads and orders flying in, every engineering decision has a critical impact on what our customers, merchants, and dashers will experience. We have to pay careful attention to the details and performance of the system to ensure all three sides of the business are operating flawlessly.

The game of caches

Caching is one common, and well-practiced way to reduce load on database and improve latency for any particular service. This is usually effective for read intensive systems e.g. fetching a menu for a restaurant. In-memory data stores such as Redis, and Memcached are commonly used tools for such a task; though they introduce additional serialize/deserialize, and network overhead. This overhead can be reduced with the help of in-process thread-safe cache (usually an LRU hash map). This in-process cache serves as an L1 cache, while Redis serves as an L2 cache, and the DB serves as master.

A typical caching setup

Reads are optimized when the L1 cache has been populated with required entries. The problem arises when there is a cache miss, which can be due to cache expiration, deployment roll-out, or a server restart. In this case, new requests must go to L2 or DB to read the value again. Doing so during peak hours, and high load can result in multiple parallel duplicate reads. This behavior is usually known as Cache stampeding or Cache Miss Storm, which causes a spike in both network traffic and latency.

Existing work

There are some existing approaches that rely on locking, or an external system refreshing the cache, or probabilistic early expiration. At DoorDash we wanted to solve the cache stampede that could be caused by a L1 cache miss, resulting in parallel duplicate reads to L2 or DB. Using built-in constructs from Kotlin coroutines we solved the problem without inventing another complex library. Towards the end of this post we will share some numbers and results we achieved. We heavily rely on GRPC, Netty and Kotlin coroutines to keep the internal microservices performant. This article assumes that readers have some basic understanding of coroutines, or the equivalent in their technology stack (Go calls them go-routines, C# calls them tasks, Python 3 also calls them coroutines etc.). While the solution discussed here is more specific to Kotlin, the general idea holds true everywhere and works really well for any event loop based async system. For example, the same effect can be achieved using Node.js promise with a simple dictionary.

The debouncer approach

To solve the problem, we took inspiration from something front-end engineers use frequently. Debouncing is a common practice in the JS world to prevent duplicate events from firing and causing noise in the system. A well-known way to create a debouncing function looks something like this (using lodash or underscore.js):

let debouncedFetchData = _.debounce(fetchData, 100);
// use debouncedFetchData …

The above line creates a debounced function that delays invoking fetchDatauntil after 100 milliseconds have elapsed since the last time the debounced function was invoked.

We wanted something similar to debounced function, but instead of waiting for 100 milliseconds, the winning coroutine among racing coroutines should quickly return a Deferred, and the remaining coroutines should wait on the returned Deferred rather than spinning up their own read. Different technologies have different names for Deferred, JavaScript has Promise, C# has Task and so on. This debouncing can be grouped on an operation ID. For example, if we are trying to fetch a menu from Redis or Database with ID 701064, we can use restaurant-fetch-701064 as a key to uniquely identify the operation. This operation may internally use exponential back-offs, call another service, read L2, fall back to database, or it might end up reading multiple tables to produce one value; but it should uniquely identify an operation that we want to deduplicate.

Our solution relies on a coroutine-safe (just like thread-safe) scoreboard that tracks pending Deferred using an ID. After a coroutine has been scheduled to fulfill a Deferred against an ID, the subsequent coroutines with the same ID use that pending Deferred to wait for results. Once Deferred completes, it is removed from scoreboard. The reader code looks something like this:

The method getRestaurantMenus, when simultaneously invoked by many coroutines, will result in one of the coroutines winning the race condition and successfully entering the body to execute fetchMenuFromRemoteCacheOrDatabase. This debounce method immediately returns Deferred<List<Menu>> to all coroutines while the fetchMenuFromCacheOrDatabase executes. All of the coroutines then proceed to await in order to read the results.

How does the debouncer actually work? In the code above, CoroutineDebouncer(ConcurrentHashMap()) relies on computeIfAbsent to create or read a Deferred atomically (be aware of map implementation you use, make sure the implementation applies the function only once). The precise implementation is dead simple and looks like this:

The computeIfAbsent allows us to launch an async callback that is scheduled for execution, and once completed we do remove from the pending board. For the ConcurrentMap parameter required in the constructor, we used a ConcurrentHashMap for simplicity, but this can be replaced with a NonBlockingHashMap for a higher performing lock-free map, or with your own custom implementation that guarantees atomic operations.

Comparing apples to apples

After applying the changes to our microservice, we benchmarked our new version against the old version. Our machine was MacBook Pro 2.2 GHz i7 with 16GB of RAM and JVM flags -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:+ParallelRefProcEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60.

The GRPC endpoint being tested is performing a full read through operation ranging from an L1 cache (5 seconds TTL), an L2 cache (10 seconds TTL) and finally falling back to our Postgres database. We used ghz to benchmark service for 60 seconds with 2000 concurrent connections and no rate limit. We explicitly chose short expiration times to simulate multiple stampedes, and observed an overall effect during the 60 second window. Here are the results:

Cold boot

Without debouncer:

Summary:
  Count: 887495
  Total: 60059.11 ms
  Slowest: 6908.89 ms
  Fastest: 0.55 ms
  Average: 135.10 ms
  Requests/sec: 14777.03
Response time histogram:
  0.546 [1]  |
  691.381 [870160] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1382.216 [7434] |
  2073.050 [4404] |
  2763.885 [2186] |
  3454.720 [209] |
  4145.555 [312] |
  4836.390 [559] |
  5527.225 [1056] |
  6218.060 [505] |
  6908.895 [669] |
Latency distribution:
  10% in 27.40 ms
  25% in 57.70 ms
  50% in 84.08 ms
  75% in 112.40 ms
  90% in 170.33 ms
  95% in 254.05 ms
  99% in 1549.95 ms
Status code distribution:
  [OK] 887495 responses

With debouncer:

Summary:
  Count: 1156274
  Total: 60041.89 ms
  Slowest: 1731.10 ms
  Fastest: 32.23 ms
  Average: 103.68 ms
  Requests/sec: 19257.79
Response time histogram:
  32.227 [1]  |
  202.115 [972011] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  372.003 [23286] |∎
  541.890 [2702] |
  711.778 [0]  |
  881.665 [0]  |
  1051.553 [0]  |
  1221.440 [0]  |
  1391.328 [43]  |
  1561.216 [942] |
  1731.103 [1015] |
Latency distribution:
  10% in 66.05 ms
  25% in 77.36 ms
  50% in 92.63 ms
  75% in 113.26 ms
  90% in 147.19 ms
  95% in 178.08 ms
  99% in 264.65 ms
Status code distribution:
  [OK] 1156274 responses

Warmed up

Without debouncer:

Summary:
  Count: 1053108
  Total: 60769.34 ms
  Slowest: 8058.86 ms
  Fastest: 0.38 ms
  Average: 114.43 ms
  Requests/sec: 17329.60
Response time histogram:
  0.382 [1]  |
  806.230 [982042] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1612.078 [9147] |
  2417.926 [5000] |
  3223.774 [1944] |
  4029.621 [479] |
  4835.469 [953] |
  5641.317 [371] |
  6447.165 [1]  |
  7253.012 [3]  |
  8058.860 [59]  |
Latency distribution:
  10% in 23.74 ms
  25% in 48.83 ms
  50% in 78.63 ms
  75% in 98.37 ms
  90% in 122.91 ms
  95% in 158.75 ms
  99% in 1474.71 ms
Status code distribution:
  [OK] 1053108 responses

With debouncer:

Summary:
  Count: 1321340
  Total: 60064.00 ms
  Slowest: 578.69 ms
  Fastest: 36.04 ms
  Average: 90.77 ms
  Requests/sec: 21998.87
Response time histogram:
  36.045 [1]  |
  90.309 [574748] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  144.573 [401937] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  198.838 [16613] |∎
  253.102 [705]  |
  307.367 [0]  |
  361.631 [0]  |
  415.896 [78]  |
  470.160 [2159] |
  524.425 [2099] |
  578.689 [1660] |
Latency distribution:
  10% in 68.67 ms
  25% in 76.69 ms
  50% in 87.01 ms
  75% in 99.63 ms
  90% in 112.48 ms
  95% in 124.60 ms
  99% in 176.08 ms
Status code distribution:
  [OK] 1321340 responses

Conclusion

Due to the reduced memory and network loads, we observed an average difference of more than ~4K requests per second. More importantly the p99 latency was reduced from ~1550ms to ~267ms (almost 83% reduction) for cold boot case, and from ~1447ms to ~176ms (almost 87% reduction) for warmed up case.

In order to verify and visually see how many extra calls we were saving over time, we instrumented CoroutineDebouncer code above, and added markers to count the number of times computeIfAbsent invoked the callback vs the total number of calls to the debounce method. We ran our benchmark for 120 seconds with 4000 concurrent requests and a mix of repeated random IDs to simulate real load. The results were encouraging:

Number of time computeIfAbsent applies the function vs total number of calls

We’ve also put together a sample JS implementation that you can use to simulate results yourself. It follows the exact principle described above, you can fork and play around with different parameters.

There can be various approaches to evict Deferred, or having some sort of waiting timeout on debouncer, but the core idea remains the same. For systems running in a cluster, this approach relies on preventing a stampede from one instance, thus minimizing a cluster-wide stampede. In a very large cluster (thousands), we might still experience a stampede, which will require a different solution. So far this approach has worked well for us in production, and helps us deliver good latency numbers.

With high traffic loads on our systems, every improvement matters and contributes to a snappy experience for our customers, merchants, and dashers. Craftsmanship, and attention to detail helps us deliver our next order in a timely manner.

Zohaib is the engineering lead on the Platform Engineering team, focused on craftsmanship, performance, intelligent systems, hacking, and system architecture.
Author's Linkedin