Building Reliable Workflows: Cadence as a Fallback for Event-Driven Processing

author:

Building Reliable Workflows: Cadence as a Fallback for Event-Driven Processing

Amid the hypergrowth of DoorDash’s business, we found the need to reengineer our platform, extracting business lines from a Python-based monolith to a microservices-based architecture in order to meet our scalability and reliability needs. One such line, our white-label delivery fulfillment business Drive, moved to a Kotlin-based service. For DoorDash Drive, reliability not only means supporting the businesses that we partner with, but also ensuring that customers get their deliveries and Dashers have a steady source of income.

As a logistics service, DoorDash inherently relies on event-driven processes. Being reliable at DoorDash Drive means ensuring our delivery creation flow, a chunk of processing steps that need to occur in order for a physical delivery to take place, is both resilient and redundant. In our monolith, we implemented this processing both synchronously and asynchronously, depending on the use case of the specific Drive order. 

As a part of the transition effort, we chose Cadence, an open source orchestration engine, to manage some of Drive’s asynchronous business logic. 

However, instead of a wholesale move to Cadence, we treated it as a stepping stone, instantiating it on a single Cassandra cluster as a fallback mechanism for Drive’s primary delivery creation flow. This choice lets us continue to support the Drive business line with a trusted, reliable flow while providing the capability of expanding our Cadence footprint as business needs and reliability dictate.

Figure 1: Order creation occurs in DSJ, our code monolith, using Redis to broker requests with Celery tasks, which retries as necessary. DoorDash is transitioning from a monolithic architecture to microservices.

The challenges of moving from a monolith to microservices

Before moving to a separate microservice, we used Celery tasks to handle our asynchronous task retries. Not only is Celery exclusive to Python, but it relies on a memory-based broker such as Redis or RabbitMQ. This means we would need to scale the memory, or, in the case of Redis, shard/re-shard keys as we scale. In contrast, Cadence provides a bring-your-own Cassandra solution, which allows us to store large, more complex data that would not perform optimally on a key-value store such as Redis. Additionally, Cassandra handles events consistently and is highly scalable both horizontally and vertically.

As we continue to scale, these monolithic design practices no longer keep up with our needs, especially since traffic at DoorDash undergoes large, sustained peaks (people tend to order food during certain times, like on a Friday night). The synchronous tasks are not performant because, on the new microservice architecture, the delivery flow involves network calls to other microservices to perform the same tasks. For this reason, we preferred asynchronous tasks, but these can be difficult to manage and work with, and non-transparent failures can lead to lost deliveries and ultimately a bad customer experience.

Our use case can be further broken down into three parts: precreate/processing, delivery creation, and postcreate/processing. These parts all involve processing steps as well as different RPC calls to internal services that provide us with essential information, such as customer ETA estimates, payment information, and fee calculations. Given the non-trivial latencies from processing and service calls that this flow entails, moving to a fully asynchronous flow is ideal.

Workflow orchestration with Cadence

Cadence’s official documentation refers to it as a “fault-oblivious stateful programming model”. As an orchestration engine developed and open-sourced by Uber, Cadence offers many nice features for managing asynchronous tasks, including automatic retries, fault tolerance, and reliability. 

Traditionally, writing scalable, reliable, and distributed code results in complicated business logic that is difficult to service and maintain, requiring hours of researching and discussing different architecture solutions for each specific use case. Cadence addresses this issue by abstracting away many of the resource limitations one would generally run into: it provides durable virtual memory that preserves the application state even in the event of failures.

Given these factors, Cadence seems to meet all of our needs in scale and resiliency in the move to a microservices-based architecture. However, Cadence is still in early adoption at DoorDash, so we did not want to rely on it for such an integral flow such as the primary delivery creation for our white-label service. In our initial deployment, we only have one Cassandra cluster dedicated to Cadence, but can scale it as our use of Cadence expands.

A scaffolding approach

Our solution to this problem was to essentially use Cadence as a fallback behind our primary delivery creation flow. Not only would this serve as a stepping stone for a full Cadence delivery creation flow, it would also instill confidence for ramping up on the integration of Cadence at DoorDash for the future in a manner that manages risk. We chose Cadence over other approaches here because of three primary factors:

  1. It is highly reliable as it preserves workflow state in the event of almost any failure.
  2. It is scalable and very flexible: we can have a few long-running workflows (think weeks and months!) that perform tasks periodically to thousands and thousands of workflows running at any given instant.
  3. Through task management abstractions, we expect to improve developer productivity, allowing us to move even faster.
Figure 2: In our delivery creation flow, once an order is created, it gets published on a Kafka topic which is read by workers on Drive service to handle delivery creation. In the case of any failures, the handler will schedule the retry workflow on Cadence. 

In coming up with a solution to adopt Cadence, we wanted two things:

  1. Primary reliance on a well tested and reliable system at DoorDash.
  2. External transparency into deliveries that are being handled by the Cadence worker.

A simplified flow, shown in Figure 2, above, can be described as follows:

  1. A customer places an order at one of our Drive partners.
    1. The Partner, integrated with our API, calls our service.
  2. The order gets created, then publishes the delivery creation payload as a message to the Apache Kafka topic.
  3. Drive service consumes this event through the Kafka topic and deserializes the message into an object. This object then gets picked up by a coroutine that runs asynchronously.
    1. At the same time, the payload and ID are written as a record to our Postgres table. It gets marked as “in-progress” to indicate that we have not yet completed this task.
  4. If the initial delivery creation fails, we catch and kick off a new Cadence workflow with the delivery creation task.
    1. Retries are automatically re-run according to a simple policy, as illustrated in Figure 3, below:
Figure 3: A simplified implementation of the retry policy. In practice, we use runtime configurable dynamic values to be able to change these parameters in real time.
  1. In the event of a serious outage, we don’t want to leave the customer’s order in limbo. Orders that have repeatedly failed in Cadence will run a separate cancellation task.
  2. Upon completion, we update the corresponding record in the Postgres table with the appropriate status, e.g. “Completed” upon completion.

Writing the workflow

There was one big challenge when designing Cadence workflows — not all of our business logic/service calls were idempotent. This idempotency requirement goes hand-in-hand with any policy — it would be very problematic if, for every order, we created multiple deliveries, for example. To address this, we came up with three possible approaches:

  1. Make the non-idempotent idempotent. 

In an ideal world, we could contact certain non-idempotent services we were making calls to and see if they could implement some sort of idempotency key for us. In a fast-moving environment with many competing priorities, this is just not feasible as developer bandwidth is limited. The other option would be to write a getOrCreate wrapper ourselves, but this is also not possible if there is not a unique key.

  1. Create child workflows/activities at the level of the non-idempotent parts.

This approach utilizes the application state guarantees that we get from Cadence. To Cadence, activities are all the same; it’s an external call to the outside world that doesn’t operate under the same guarantees that Cadence can. However, Cadence still sees activities as functions with a return, so it can determine if it has completed successfully or not. As long as we are careful with coordinating the activity timeouts to equal or exceed our service timeouts, there should be no problems with idempotency at that level.

  1. Manually save a context to track non-idempotent components that have been completed.

This approach involves using a delivery creation context to save the state of the non-idempotent parts. This way, upon failure, we can check the state of non-idempotent parts and skip these in the retry.

We chose the third approach, as it gives us visibility into which components have been completed. Additionally, as we are using Cadence as a fallback mechanism, we would need to save the context of the initial delivery creation flow anyways. In the future, we plan on moving to option two when we fully adopt Cadence, taking advantage of the built-in activity execution monitoring in Cadence.

Performance gains with Cadence

Moving to a fully asynchronous-based event driven processing system will in our case result in big performance gains as we move off a synchronous processing model for one of our core flows. Eventually, through the transitioning process of having Cadence serve as a fallback option to a critical component, we will eventually build enough confidence in the platform to use it solely for our primary delivery creation flow. This will bring us massive gains in developer productivity due to its ease of use and abstraction of fussy details.

For any service-oriented architecture, Cadence will prove invaluable in being able to perform event-driven and scheduled processing in a safe, reliable, and performant manner. In general, we believe Cadence is a promising platform that can be adopted for a wide variety of use cases, such as asynchronous processing and distributed cron. However, many developers may be hesitant to replace their current solutions with Cadence straight into production as it is still relatively new. For cases such as these, we think it worth building Cadence into less production-critical flows such as fallback, like we did at DoorDash, as it promises enormous potential benefits.

Alan Lin joined DoorDash for our 2020 Summer engineering internship program. DoorDash engineering interns integrate with teams to learn collaboration and deployment skills not generally taught in the classroom. 

Header photo by Martin Adams on Unsplash.

Alan Lin is a rising senior studying Computer Science at Cornell. In the summer of 2020, he was an intern on DoorDash's Drive Platform team.