Financial partnerships are tricky to manage which is why DoorDash needed the right technology stack to be able to quickly onboard new Dashpass partners. The challenge was that each partner brought with them a diverse set of conditions and rules that our system needed to be able to accommodate without skipping a beat. To ensure that these integrations could be carried out quickly we needed to choose a technology stack that would enable us to manage all our partner considerations and onboard them to the platform in a timely manner.
After a thorough technology review of the leading task processing technologies we chose Cadence as the task processing engine and opted to follow the separation of concern (SoC) design principle in order to gain reliability, visibility and encapsulate the details. Below we will explain the challenges of ensuring faster DashPass partner integrations and how we conducted a technology review to select Cadence as the best technology to help us speed up integrations.
Background: How DashPass partnerships work
DashPass partners with several banks, including Chase and RBC, to offer credit card customers a free DashPass for limited periods of time. To provide this benefit, each partner must be integrated into the DashPass system for card eligibility checks and reconciliation. But integrating each partner with our systems took an extended amount of time — RBC integration took several quarters — because of a variety of challenges, including:
- Different business requirements for each financial partner
- Varying synchronous and asynchronous reconciliation processes
- Race conditions resulting in data corruption and unexpected behavior
- Unclear ownerships that reduce overall engineering efficiency and create confusion for team collaborations.
We were able to resolve each of these challenges by building a more coherent platform that speeds up the onboarding process considerably.
Subscribe for weekly updates
Challenge 1: Integration logic varies between financial institutions
Each partner has established different rules around how customers will be allowed to enjoy the DashPass benefit. These differences can be related to factors like how long the customer gets the benefit, when the benefit kicks in or lapses and more.
Such complexities lead to multiple branches in the decision tree, causing our code base to grow more complex as more partners come on board. If we fail to build solutions to contend with this branching, our code becomes more difficult to read, maintain, and scale.
Challenge 2: Each institution handles reconciliation differently
Reconciliation is an essential part of dealing with transactions on cards we cannot yet verify, a process known as multi-match. But each institution deals with reconciliation differently. For example, some conduct reconciliation synchronously, while others require asynchronous reconciliation over multiple days. To enable a good user experience in multi-match cases, we may have to compensate after a certain period of time has passed.
Challenge 3: Lack of visibility, reliability, and control
The workflow of claiming DashPass benefits involves multiple steps and branches. Without some mechanism to control what is happening at corresponding steps, it is difficult to retry on failed steps, gain visibility into where the customer has progressed at each step, and recover from infrastructure failures, (i.e. corountines that are “fire and forget” could be lost) and server timeouts.
Challenge 4: Race conditions and idempotency
Write requests can take some time in certain cases, causing the client to commit a retry, which can result in data corruption because there are two write requests for the same user and the same operation. For example, we use Redis locks for a few endpoints like “subscribe” to protect against users receiving two active subscriptions, but this is not an ideal solution.
Challenge 5: No clear ownership separation
DashPass backend evolved organically as a line-for-line rewrite of DSJ, our legacy Django monolith application. Multiple teams subsequently have worked on DSJ without clear separation of concerns. Core business logic flow — which intercepts payment methods being added and creates links that make users eligible for a partnership DashPass — is intermingled with integration logic specific to particular partners.
This highly imperative code hurts our development velocity and operational excellence. Debugging regressions and supporting customers can become time-consuming because of limited observability. Because it's hard for new developers from other teams to make new integrations, team collaboration becomes complicated , and it's easy to introduce bugs. We use Kotlin coroutines that spawn from main gRPC requests to drive much of the logic, but that is both error-prone — the gRPC server can die at any moment — and is hard to debug.
Key objectives to achieve with improved integrations
In addition to resolving complexity issues, improving visibility, reducing potential infrastructure failure, centralizing control, and clarifying ownership separation, we are pursuing several key objectives with the DashPass partner integration platform, including:
- Reducing the engineering time and complexity in onboarding new partners
- Introducing an interface that assigns core flow to the platform team and institution-specific integration logic to collaborating teams, allowing them to implement a well-defined interface to integrate a new DashPass partner while minimizing effort and the surface area for regressions
- Gaining visibility into what step each customer has reached as they progress alongside related customer information, card information, and financial response information
- Making the partner subscription flow immune to infrastructure failures by allowing the server to recover and retry at the last successful step after interruptions
- Creating centralized control of the workflow to allow query, history look-up history, and previous behavior analysis
Our solution is to build a platform with flexible workflows to allow fast integration of future financial partners. There are, however, many choices of technology stack for workflow management. Here is an overview of our technology selection process and why we ultimately chose Cadence.
Selecting the right technology stack
Among the technology stacks we considered were Cadence, Netflix Conductor, AWS Step Functions, and in-house solutions such as Kafka and Postgres. To assess the choices, we considered the following features:
- Language used in the client library.
- Ease-of-use in implementing our codebase and whether we needed to change our infrastructure to accommodate features.
- Easy querying in both synchronous and asynchronous workflow states.
- Easy look-ups to to search workflows based on, for example, customer ID.
- Historical check to verify results.
- Testable to confirm integrations.
- Backwards compatibility to support future workflow changes.
- Logging/monitoring and the ease of setting them up.
- High performance in the face of additional layers of complexity.
- Reliability in the event of failure, including allowing server-side retries following recovery.
Our technology review
Ultimately, we took deep dives into four options for our technology stack: Cadence, Netflix Conductor, AWS Step Functions, and building an in-house solution.
Cadence made it onto our shortlist because it's flexible, easy to integrate and ID unique that would address our use case.
- Easy and fast to integrate
- Open source, so no infrastructure restrictions
- Guarantees exactly-once job execution with a unique id that cannot be executed concurrently, solving race conditions that currently require locks
- Allows failed jobs to retry, creating a valuable recovery mechanism
- Provides a way to wait for job completion and result retrieval
- Popular language libraries already built-in
- Small performance penalties
- Scales horizontally with ease
- Supports multi-region availability
- Offers thorough documentation and already familiar to our team
- No reliance on specific infrastructure
- No limits on workflow and execution duration
- Easy search function
- Simplified test setup for integration tests
- Configuration not as flexible as an in-house solution
- Long-lived actors are consciously thrown out for backward compatibility
- History storage must be done manually, limiting search
Netflix conductor came highly recommended because of its wide support for different languages, has production testing, is open sourced and is widely used.
- Open source, so no infrastructure restrictions
- Supports Java and Python clients
- Supports parallel task executions
- Supports reset of tasks
- DSL-based workflow definition, while starting simple, can become complicated as workflow becomes more complex
An In-house solution
While it was certainly an option to select an open source technology we also had the option of building something ourselves (i.e. Kafka + Postgres).
- We dictate the workflow control mechanism
- Allows implementation of TCC instead of SAGA for transaction compensation
- Building an in-house solution requires significant engineering effort
- Extra complexity because message queue solution would have to poll for result completion
AWS Step Functions
AWS step function was added to our shortlist because it also provides workflow solutions with failure retries and observability.
- Offers Java client library
- Provides a retry mechanism for each step
- Tight throttling limits
- Requires infrastructure change, engendering extra work
- Difficult integration testing
- Offers state-machine/flow chart instead of procedure code
- Inflexible tagging limits elastic search
Why we chose Cadence to power our workflows
Ultimately, we chose Cadence because of its flexibility, easy scaling, visibility, fast iterations, small performance penalty, and retry mechanism. Unlike AWS Step Functions or a similar DSL-based approach, Cadence allows flexible development of a complex workflow. In addition to allowing synchronous waiting for job completions, Cadence scales well and is available across multiple regions.
Workflow visibility is key to solving customer issues. Cadence’s elastic search allows for that. Additionally, easy integration tests through the Cadence client library allows fast iteration and confidence in our code. With a low roundtrip time for a synchronous workflow — p50 - 30ms p99 - 50ms — Cadence requires no performance penalty and brings in little degradation in latency. To avoid overloading downstream services during downtime, Cadence provides easy configuration for retries and exponential backoff.
Setting up Cadence workflows
Our sample workflow links customer credit cards to DashPass partners. When a customer adds a card, the card’s information is validated against payment method eligibility including such things as verifying country validity and bin number. If eligibility checks are successful, the partner API is called for further eligibility checks, which are sorted as eligible, not eligible, and multi-match. Multi-match, in particular, triggers a fallback check as a follow-up action.
In Figure 1, the workflow is diagrammed with green boxes indicating where specific integrations can deviate and where core flow can call out to the corresponding integration. Core flow receives integrations via Guice injection, asks each for eligibility checks, and follows up accordingly. Eligibility checks are included in the Cadence activity. The Cadence activity will call out to the partner implementation accordingly. If fallback checks are required, a separate workflow will be spun up.
We set up integration tests to test old paths and new paths (which uses Cadence) to verify they have the same outputs — meaning the same gRPC response and database creation/update.
We also have shadowing in the rollout to validate the process. In shadow mode, gRPC outputs are compared asynchronously and dry mode is enabled to pretend like we will create a membership link (membership link here means the credit card has been linked to a financial institution successfully) in the subscription database and see if it is the same as the original one.
It is also worth mentioning that core flow is decoupled from plan integrations this way as a separation of concerns pattern. We have developed interfaces for new partners that abstract away implementation details which are represented by green boxes, shown in figure 1 above. Core flow calls into the particular implementation's method to execute green box logic. Integrations are injected using Guice/dependency injection into the core flow at startup time.
In the months since rollout, there have been no major issues. While integrating RBC took several quarters before we introduced the new integration platform, our integration of Afterpay following the platform’s rollout was completed within a single quarter. Under the new process, creating link partner integration requires only one or two pull requests. Additionally, Cadence has allowed us to address ownership separation concerns and speed future integrations.
Cadence Workflow could be used to resolve similar visibility and reliability challenges in other situations with minimal effort. Among its benefits are increased visibility into workflow and activity, a simplified retry process, free locks via workflow ID, and guaranteed exactly-once-per-task execution.