Failures in a large, complex microservice architecture are inevitable, so built-in fault tolerance — retries, replication, and fallbacks — are a critical part of preventing system-wide outages and a negative user experience. The industry today is replacing legacy monolithic architectures with microservices to capitalize on the promise of better scalability and developer productivity. As with anything in distributed systems, this new architecture comes with tradeoffs — most notably complexity. Interservice communication is done through remote procedure calls, or  RPCs, which add a whole new set of issues to consider. But designing for fault tolerance from the start can ensure the system continues to operate properly in the event of failures.

RPCs can fail for many reasons and fallbacks should be added for resiliency. Other strategies can be employed to recover from a failed RPC, such as retries or distributing traffic to healthy replicas of the target service. When the outage involves the communication link, however, those strategies can’t contribute to system recovery. Having a fallback path ensures that the RPC succeeds regardless of failures in the primary path.

Of course, DoorDash’s microservice architecture is not immune to failures. We work particularly hard to improve fault tolerance in our translation systems because users expect to be able to use our products in their language of choice. Any outage affecting those systems could frustrate users and block customers from placing orders, making it critical that our system operates smoothly even in the event of failures. Here we discuss the types of failures that can occur with RPC calls, summarizing with an example of how we solved for fault tolerance in our translation service. 

Types of issues that can cause an RPC to fail

Interservice communication in microservices architectures can fail for a myriad of reasons, but categorizing these errors into larger groups, each with a common theme, helps in coming up with appropriate generic solutions for each bucket. RPCs can fail because: 

  • networks are unreliable 
  • humans introduce bugs in the code
  • microservices can be dependent on databases. 

We’ll dive into more details about each group below.

Unreliable networks

Networks in distributed systems often are unreliable. The types of errors we experience can be categorized into two large sub-groups: transient errors and persistent errors. Transient errors disappear when retrying the same network request again; such errors might be caused by a spike in network traffic that causes the request to be dropped. Persistent errors, on the other hand, require an external intervention to be resolved. Incorrect DNS configurations, for example, can make a microservice or even a whole cluster unreachable. It is important to account for these issues and plan to overcome them.

Application-level bugs

Bugs are a reality of software development; they can sneak through even the best testing suites and CI pipelines and disrupt RPCs. For instance, an RPC can fail because a bug unexpectedly interrupts the target service. In an architecture using distributed deployments, a bug might be contained within only one set of a microservice’s replicas. In such cases, retrying the request with a different recipient might resolve the problem. The deployment system could then roll back the affected services. But if the bug has made its way to all replicas, the RPC inevitably fails without an external intervention. Because bugs must be expected from time to time, the system should be designed to continue normal operations despite them.

Database-level errors

It’s common for the recipient of an RPC to reach out to its database to generate a response to the sender. This is true in our translation systems. The DoorDash architecture reaches out to the translation microservice to retrieve translated content; the translation microservice reads those strings from the database to generate a response. Many issues can occur because of the microservice’s dependency on the database; for instance, the connection might be having issues or the database server might be down or under heavy load causing an empty response or an exception to be returned from the microservice. The client should be able to detect such errors and act appropriately to circumvent the issue.

Subscribe for weekly updates

Strategies to handle RPC failures

Multiple patterns can be employed to overcome RPC failures, enabling the system to continue operating in the face of outages. There are three key methods to recover from RPC errors: 

  • retrying a failed RPC
  • rerouting traffic from a faulty microservice to a healthy one
  • adding a fallback path to the RPC

Retrying failed RPCs

A common pattern to recover from a failed RPC is to retry the call again. As shown in Figure 1, transient errors are the usual group of failures where employing the retry mechanism can be a great choice. When implementing retries, one needs to be careful about overloading the network.

Figure 1: Retry pattern with exponential backoff
Figure 1: Retry pattern with exponential backoff

Consider a network experiencing a heavy load that drops an RPC for lack of resources. Immediately retrying the call not only returns another failure, but also adds more pressure to the already saturated network. Instead, deploy an exponential backoff algorithm. With this method, the wait time at every retry for the next call is doubled to give the network the opportunity to recover instead of flooding it with more traffic. 

As helpful as retries can be, they aren’t the perfect solution to every problem. Retries can’t overcome persistent errors within the communication link, for example, so let’s look at the next strategy for improving fault tolerance. 

Error correction through replicas 

Creating multiple replicas of a microservice minimizes the chance of experiencing a single point of failure. As shown in Figure 2, a load balancer is set up between the requester and receiver. When the load balancer notices that one of the replicas is unresponsive, it can redirect traffic going into the faulty replica into healthy ones. 

Figure 2: Faulty and healthy microservice replicas
Figure 2: Faulty and healthy microservice replicas

There may be many reasons why one replica experiences issues. As we saw previously, a bug may have been deployed to a group of replicas, but so long as there are still healthy replicas in the cluster we can minimize the impact. While container orchestration systems make it easy to set up multiple replicas for a service, replicas alone can't guarantee the architecture’s fault tolerance. Two examples: a bug might not have been caught, allowing it to be deployed to all instances of a service, or none of the instances can access the database.

Using fallbacks to respond to failures

When an RPC fails and none of the previous strategies works for recovery, fallbacks can save the day. In principle, the idea is simple: when an RPC’s primary path breaks, it falls back to a secondary path, as shown in Figure 3. Retries can’t recover from persistent communication link errors, so fallbacks use a different communication link. When none of the available microservice replicas can process an RPC, fallbacks fetch the result from another source. 

As great as fallbacks sound for handling distributed system errors, they do come with drawbacks. The fallback path usually offers reduced performance or quality compared to the primary path. Nonetheless, it works well as a last resort to avoid or resolve outages. For a user, degraded performance is better than the service not being available at all. In conjunction with the strategies described previously, fallbacks help create a strong, fault-tolerant architecture.

Figure 3: Try fallback path when primary fails
Figure 3: Try fallback path when primary fails

Successful failure recovery requires the fallback path to be transparent to the sender and independent from the primary path:

  • Transparent to the sender: Both the primary and fallback paths should at least be similar, if not identical. The sender microservice need not be aware of how the results were fetched so long as the response to the RPC is correct. Implementation details of the fallback mechanism can be abstracted away and contained within the RPC client used by the sender. The sender might tolerate slightly outdated data or decreased performance to maintain normal operation during outages.
  • Independent from the primary path: Fallback paths must be independent of primary paths to prevent an outage from bringing both down. Two independent systems working in parallel with respective reliabilities (and failure probability) of R1=90% (F1=10%) and R2=90% (F2=10%) will have a combined reliability of R=99% (R = 1-F = 1-F1*F2). To design fault-tolerant architecture, fallbacks must be independent from the primary path.

Improving fault tolerance in DoorDash translation systems 

DoorDash uses a microservices architecture that employs gRPC to communicate between services. Product microservices return content to clients in the user’s preferred language. The translated content is fetched from the platform’s translation microservice. A translation client simplifies the interaction with the translation systems, exposing a simplified interface that different teams can use to fetch translated content.

Many services of critical importance to our users depend on translation systems. In our three-sided marketplace, an order may be created, prepared, and delivered by people speaking different languages. It is important for us to accommodate all of our users by providing a multilingual app experience. In addition to our home base in the U.S., DoorDash operates in Australia, Canada, Germany, and Japan requiring the app to support several different languages. To be successful, international operations require reliable and fault-tolerant translation systems.

In an effort to get 1% better every day, we continuously monitor our system to seek possible fault tolerance improvements. Prior to implementing fallbacks, we already were employing exponential backoff retries in our translation client and our translation service was configured to have multiple replicas to handle requests. But we didn’t think that was enough to guarantee that 100% of content appeared localized in the apps. We decided to invest in fallbacks for this critical path to ensure reliable content retrieval.

Implementing fallbacks into RPCs to translation systems

To ensure that we avoided issues with the translation service or its communication link, the fallback needed to fetch localized content from a different source. We decided to store the RPC responses as files in an object storage system. Localized content, which tends not to be updated frequently, usually is a large dataset, making it convenient to bundle into a file.

For the object storage system, we chose Amazon Simple Storage Service (S3). S3’s availability is independent from the translation service, which meant that an outage on one wouldn’t affect the other. S3 is also an industry standard solution with high guarantees for scalability, data availability, security, and performance. Moreover, because S3 is employed for many different use cases within our architecture, DoorDash already had strong knowledge and experience with it.

Figure 4: Periodic job to create backup files
Figure 4: Periodic job to create backup files

The S3 bucket contains a file for each feature that needs localized content. To populate the bucket, we created a periodic job within the translation service, as shown in Figure 4. The job generates the RPC response that the translation client would otherwise receive and writes that data into a file. Once the file is created, it gets uploaded to the bucket. 

Localized data periodically is requested via the translation client from product microservices within DoorDash’s architecture. In normal operations, the translation service reads this data from its database and returns it as the response. The database is continuously updated with new content for a feature. 

To prevent the backup files from becoming outdated, the periodic job in the translation service needs to run more often than the translation client requests localized content. Some services might run the translation client more often than others under certain conditions, so there is a chance for the retrieved S3 file to be slightly outdated. This is an acceptable tradeoff compared to having no localized content at all. We have also enabled versioning in our S3 bucket so that new runs of the periodic job upload a new version of the files. In the event of issues, it is easy to revert to previous versions. To avoid wasting storage space, older versions that are no longer needed are deleted periodically according to lifecycle rules.

The files uploaded from the translation client to S3 had to be written in a format that was easily digestible. The gRPC requests to the translation service were using protocol buffers to serialize the response before sending it over the wire. For our use case, it made sense to serialize data into the files via protocol buffers as well. We already were benefiting from protobufs in our gRPC communication and those same characteristics are valid for the backup files fallback mechanism. Protocol buffers are language-agnostic and fast to process, plus the encoded data is smaller than other formats, making protobufs ideal for our high-scale, multi-language backend architecture.

Figure 5: Translation client before and after adding a fallback
Figure 5: Translation client before and after adding a fallback

After we built the source for our fallback path, we were ready to add fallbacks to the translation client RPCs. If the RPC to retrieve localization content failed, despite retries and the high-availability architecture of the translation service, the translation client now falls back to pulling the localized content for the requested feature from S3, as shown in Figure 5. Once the raw data is received by the client, it is deserialized with protobufs in the same manner as the normal gRPC response. During an outage, the data pulled from S3 is either identical or, at worst, slightly behind the data that would have come from the translation service directly. Because S3 availability is independent from the translation service’s availability, it aligns with the two characteristics of a successful fallback.

It’s crucial to react quickly during incidents. We updated the client so that we can configure primary and fallback paths on the fly. During a degradation of the primary path, we might decide to promote the fallback path as primary and vice versa. Using DoorDash’s distributed configuration system, we can update this configuration for all microservices or only selected ones without needing to redeploy the affected services. This enables us to react promptly and with confidence to more long-lasting failures.

Conclusion: Embrace failures in distributed systems

Fault-tolerant architectures embrace failures as the norm. In high-scale architectures, it’s a recipe for disaster to plan around avoiding failures. With all the variables that go into a large system, including hardware and software unreliability as well as adding the human factor to the equation, errors are inevitable.

With the myriad strategies described here to mitigate failures, it is possible to keep systems operational despite them. The fallbacks we have deployed to improve our system’s fault tolerance is just one of the many initiatives DoorDash undertakes to provide our customers the best service possible.