How DoorDash Standardized and Improved Microservices Caching

October 19, 2023 12 Minute Read Backend, Data 9

Lev Neiman

Lev Neiman is a Software Engineer at DoorDash, since June 2021, working for DashPass platform team where he helps scale DashPass in volume and complexity.

Jason Fan

Jason Fan is a Software Engineer at DoorDash, since June 2021, working for DashPass Partnerships team to unlock and accelerate sign-ups through partner integrations.

Boosting performance while supporting business logic

In the world of DoorDash microservices, the focus lies more on implementing business logic than on performance optimization. While optimizing I/O patterns in the code could improve performance, rewriting business logic to do so would be time-consuming and resource-intensive. The problem, then, becomes how to boost performance without overhauling the existing code.

One orthodox solution is caching — the practice of storing copies of frequently accessed data in close proximity to where it is needed to improve speed and performance for subsequent requests. Caching can be added transparently to business logic code simply by overloading methods used to retrieve data.

The most common caches at DoorDash are Caffeine for local caching and Redis Lettuce for distributed caching. Most teams use Caffeine and Redis Lettuce clients directly in their code.

Because there are common problems with caching, many teams were running into similar issues while implementing their own independent approaches.

Problems:

Cache staleness: While implementing caching for a method is straightforward, it’s challenging to ensure that the cache remains updated with the original data source. Resolving issues that arise from outdated cache entries can be complex and time-consuming.
Heavy dependency on Redis: Services frequently encountered a high rate of failure whenever Redis was down or experiencing issues.
No runtime control: Introducing a new cache can be risky because of the lack of real-time adjustments. If the cache encounters issues or requires tuning, changes require a new deployment or rollback, which consumes both time and development resources. Additionally, a separate deployment is required to tune cache parameters like TTL
Inconsistent key schema: The absence of a standardized approach for cache keys complicates debugging efforts. Specifically, it’s difficult to trace how a key in the Redis cache corresponds to its usage in Kotlin code.
Inadequate metrics and observability: The absence of uniform metrics across teams resulted in a lack of critical data, such as cache hit rates, request counts, and error rates.

Difficulty in implementing multilayered caching: The previous setup didn’t easily support the use of multiple caching layers for the same method. Combining a local cache and a more resource-intensive Redis cache could optimize results before resorting to fallback.

Dream big, start small

While we ultimately created a shared caching library for all of DoorDash, we started with a pilot program to tackle caching problems for just one service — the DashPass backend. We wanted to battle-test and iterate on our solution before adopting it elsewhere.

At the time, DashPass was experiencing scaling challenges and frequent brownouts. DoorDash was growing rapidly and seeing increasing traffic every week. DashPass was one of the highest users of our shared Postgres database; a database on which almost all of DoorDash relied; if it went down, customers would not be able to place orders.

Simultaneously, we were also rapidly developing new features and use cases for DashPass, so the developer bandwidth for performance tuning was low.

With all of this critical activity occurring alongside pressure to stabilize the service — even as most engineers were busy managing business-related features — we decided to develop a simple caching library that could be integrated transparently and with minimal disruption.

Single interface to rule them all

With each team using different caching clients, including Caffeine, Redis Lettuce, or HashMaps — there was little consistency in function signatures and APIs. To standardize this, we introduced a simplified interface for application developers to use when setting up new caches, as shown in the following code snippet:

interface CacheManager {
    /**
   * Wraps fallback in Cache.  
   * key: Instance of CacheKey.  
   *      Subclasses of CacheKey define a unique cache with a unique 
   *      name, which can be configured via runtime.
   * fallback: Invoked on a cache miss. The return value is then cached and 
   *           returned to the caller.
   */
    suspend fun <V> withCache(
        key: CacheKey<V>,
        fallback: suspend () -> V?
    ): Result<V?>
}
/**
 * Each unique cache is tied to a particular implementation of the key.
 *
 * CacheKey controls the cache name and the type of unique ID.
 *
 * Name of the cache is the class name of the implementing class.
 * all implementations should use a unique class name.
 */
abstract class CacheKey<V>(
    val cacheKeyType: String,
    val id: String,
    val config: CacheKeyConfig<V>
)
/**
 * Cache specific config.
 */
class CacheKeyConfig<V>(   /**
     * Kotlin serializer for the return value. This is used to store values in Redis.
     */
    val serializer: KSerializer<V>
)

This allows us to use dependency injection and polymorphism to inject arbitrary logic behind the scenes while maintaining uniform cache calls from business logic.

Layered caches

We wanted to adopt a simplified interface for cache management to make it easier for teams that previously used only a single layer to enhance performance through a multi-layered caching system. Unlike a single layer, multiple layers can boost performance because some layers, such as local cache, are much faster than layers involving network calls — for instance, Redis — which already is faster than most service calls.

In a multi-layer cache, a key request progresses through the layers until the key is found or until it reaches the final source of truth (SoT) fallback function. If the value is retrieved from a later layer, it's then stored in earlier layers for faster access on subsequent requests for the same key. This layered retrieval and storage mechanism optimizes performance by reducing the need to reach the SoT.

We implemented three layers behind a common interface as shown in Figure 1:

Request local cache: Lives only for the lifetime of the request; uses a simple HashMap.
Local cache: Visible to all workers within a single Java virtual machine; uses a Caffeine cache for heavy lifting.
Redis cache: Visible to all pods sharing the same Redis cluster; uses Lettuce client.

Figure 1: Multi-layer cache request flow

Runtime feature flag control

Various use cases may call for different configurations or turning entire caching layers off. To make this much faster and easier, we added runtime control. This allows us to onboard new caching use cases once in code, then follow up via runtime for rollout and tuning.

Each unique cache can be controlled individually via DoorDash’s runtime system. Each cache can be:

Turned on or off. This can be handy if a newly introduced cache strategy has a bug. Instead of doing a rollback deployment, we can simply turn the cache off. In off mode, the library invokes fallback, skipping all cache layers entirely.
Reconfigured for an individual time to live (TTL). Setting a layer’s TTL to zero will skip it entirely.
Shadowed at a specified percentage. In shadow mode, a percentage of requests to cache will also compare cached value against the SoT.

Observability and cache shadowing

To measure cache performance, we collect metrics on how many times a cache is requested and how many times requests result in a hit or miss. Cache hit ratio is the primary performance metric; our library collects hit ratio metrics for each unique cache and layer.

Another important metric is how fresh cache entries are compared to the SoT. Our library provides a shadowing mechanism to measure this. If shadowing is turned on, a percentage of cache reads will also invoke fallback and compare cached and fallback values for equality. Metrics on successful and unsuccessful matches can be graphed and alerted. We also can measure cache staleness — the latency between cache entry creation and when the SoT was updated. Measuring cache staleness is critical because each use case will have a different staleness tolerance.

In addition to metrics, any misses also generate error logs, which itemize the path in the objects that differs between cached and original values. This can be handy when debugging stale caches.

Providing observability into cache staleness is key for empirically validating a cache invalidation strategy.

Example usage

Let’s go over an example and dive deeper into library API.

Each cache key has three main components:

Unique cache name, which is used as a reference in runtime controls.
Cache key type, a string representing the key’s type of entity to allow categorization of cache keys.
ID, a string that refers to some unique entity of cache key type.
Configuration, which includes default TTLs and a Kotlin serializer.

To standardize key schema, we chose the uniform resource name (URN) format:

urn:doordash:<cache key type>:<id>#<cache name>

The library provides a CacheManager instance, which is injected and has a `withCache` method that wraps a fallback or another Kotlin suspend function to be cached.

For instance, if we have a repository UserProfileRepository with a method GetUserProfile that we want to cache, we could add the following key:

class UserProfileRepositoryGetUserProfileKey(userId: String): CacheKey<UserProfile>(
cacheKeyType = "user",
id = userId,
config = CacheKeyConfig(serializer = UserProfile.serializer())
)
...
suspend fun getUserProfile(userId: String): UserProfile = CacheManager.withCache(UserProfileRepositoryGetUserProfileKey(userId)) {
... <Fetch user profile> ...
}.getOrThrow()

A key for the user with id “123” would be represented as an URN as follows:

urn:doordash:user:123#UserProfileRepositoryGetUserProfileKey

Note that any other CacheKey that uses “user” as the cache key type will share the same prefix as UserProfileRepositoryGetUserProfileKey.

Standardizing how keys are represented is great for debugging observability and opens up unique opportunities for pattern-matching keys.

Use case guidance

Once we created and battle-tested the library in DashPass, the next step was to get it to developers and help them integrate it into their work as seamlessly as possible. To do so, we gave high-level guidance on when and how to use caching — and, just as importantly, when not to use it.

When to use caching

We can break up caching use cases by eventual consistency constraints.

Category 1: Can tolerate stale cache

In certain use cases, it’s acceptable to have a few minutes of delay for updates to take effect. In these situations, it’s safe to use all three caching layers: request local cache, local cache, and Redis layer. You can set the TTL for each layer to expire in several minutes. The longest TTL setting across all layers will determine the maximum time for the cache to become consistent with the data source.

Monitoring cache hit rates is crucial for performance optimization; adjusting the TTL settings can help improve this metric.

In this scenario, there's no need to implement shadowing to monitor cache accuracy.

Category 2: Cannot tolerate stale cache

When data is subject to frequent changes, stale information could adversely affect business metrics or user experience. It becomes crucial to limit the maximum tolerable staleness to just a few seconds or even milliseconds.

Local caching should generally be avoided in such a scenario because it can't be invalidated easily. However, request-level caching might still be suitable for temporary storage.

While it is possible to set a longer TTL for the Redis layer, it's essential to invalidate the cache as soon as the underlying data changes. Cache invalidation can be implemented in various ways, such as deleting the relevant Redis keys upon data updates or using a tagging approach to remove caches when pattern-matching is difficult.

There are two main options for invalidation triggers. The preferred method is to use Change Data Capture events emitted when database tables are updated, although this approach may involve some latency. Alternatively, the cache could be invalidated directly within the application code when data changes. This is faster but potentially more complex because multiple code locations can potentially introduce new changes.

It is crucial to enable cache shadowing to monitor staleness because this visibility is vital for verifying effectiveness of the cache invalidation strategy.

When not to use caching

Write or mutation flows

It’s a good idea to reuse code as much as possible so that your write endpoint may reuse the same cached function as your read endpoints. But this presents a potential staleness issue when you write to the database and then read the value back. Reading back a stale value may break business logic. Instead, it's safe to turn off caching altogether for these flows while reusing the same cached function outside of the CacheContext.

As a source of truth

Do not use the cache as a database or rely on it as a source of truth. Always be mindful of expiring caching layers and have a fallback that queries the correct source of truth.

Conclusion

DoorDash's microservices faced significant challenges as a result of fragmented caching practices. By centralizing these practices into one comprehensive library, we dramatically streamlined our scalability and bolstered safety across our services. With the introduction of a standardized interface, consistent metrics, a unified key scheme, and adaptable configurations, we have now fortified the process of introducing new caches. Moreover, by offering clear guidance on when and how to employ caching, we've successfully staved off potential pitfalls and inefficiencies. This strategic overhaul has positioned us to harness caching's full potential while sidestepping its common pitfalls.