As DoorDash’s business has grown with increasing order volumes and through emerging businesses including grocery delivery, our customer support experience also needed to scale up efficiently. The legacy support application that DoorDash had built to issue credits and refunds was created only to address the original food delivery service. It couldn’t handle the needs of our new verticals.
We needed a more scalable and automated means of distributing credits and refunds when customer experiences failed to meet our quality guarantees. These new performance requirements could not realistically be added to our legacy credits and refunds service, which operated on our legacy backend framework, Python. Instead, we pursued a complete redesign to upgrade our system, building a no-code platform and migrating the service to Kotlin, our new backend framework.
In this article, we will walk through how we rewrote the service to solve the challenges inherent in issuing credits and refunds for new verticals. We also discuss how we migrated the system to Kotlin.
Redesigning the customer support application
Our legacy customer support application utilized a configurator, which is what we call our web-based tool. It allows customer service agents to associate a customer issue with a corresponding resolution strategy. The criteria for selecting resolution strategies was defined in the code. We expected to provide fast, accurate resolutions at relatively low cost with this new customer support platform. We narrowed down the next step for the upgrade to two possible approaches.
Our first approach would involve continuing to leverage the legacy architecture. Because our engineers developed the legacy application and have a thorough knowledge of it, composing new resolution strategies for the legacy system would be straightforward, streamlining the development work. The tradeoff inherent to this approach, however, is that we would have to rely heavily on the engineers to make continuous code changes to support the customer support team as they adjusted resolution strategies and ran optimization experiments.
Alternatively, we could move the code-defined resolution strategies and experimentation capabilities outside the codebase to make them configurable by non-engineers. A configuration-driven, no-code solution would reduce the reliance on engineers and enable our operators to move faster because they could translate resolution strategies into configurations in a “what you see is what you get” manner. To ensure that this solution would scale with our future business needs, we decided to use a decision tree configuration. This would break down ambiguities because each path in a decision tree can represent a unique resolution strategy. When a new strategy is introduced with the decision tree representation, the configuration can easily extend by adding a new branch to the tree. The downside of this approach is that it requires more up-front investment in resources when compared to just extending the code. We would also need to train operators on how to define strategies and experiments using the self-serve configurator.
Ultimately, we decided that the second approach’s benefits outweigh the drawbacks. Consequently, we elected to redesign the basic customer support application into a configurable no-code platform that can support fast changes and experimentation.
Migrating the system from Python to Kotlin
The "componentization" of a credit and refund strategy could be implemented in the legacy Python codebase at the same time we worked to spin up a new Kotlin service. In other words, we could continue to improve the existing application to meet current business needs — a relatively low-cost action — and simultaneously tackle system migration as a separate effort. A fast-growing business requires the fast delivery of technical solutions. Of course, the tradeoff is that failing to address the underlying technical issue means we would continue to build code that adds to our technology debt, not to mention that we eventually would need to deprecate the code. But tackling a large migration effort in a fast-growing environment creates a risk of significant business disruption.
An alternative option would be to stop building new code in our existing Python application to focus exclusively on spinning up our new Kotlin service. As new business requests arose, we could implement those requests in the Kotlin service. This approach would offer the advantage of not building tech debt while steadily migrating code without significant disruption to the business. A key drawback: We would have a hybrid state with both the legacy system and the new system involved. Both systems would have to be maintained and monitored for a longer period of time.
However, one additional factor tipped the balance in favor of Kotlin. DoorDash Engineering’s decision to use Kotlin as its microservices programming language meant that our new service would be operating fully inside DoorDash’s tech ecosystem and infrastructure.
We chose that second approach and created a new customer support platform in Kotlin while gradually migrating the legacy application over. We believe this path provides the best chance for initiating and completing the systems migration without significantly disrupting our business.
Making the credit and refund strategies configurable
After we created a new Kotlin service, we defined gRPC endpoints to create and read a credit and refund strategy. Our biggest redesign effort revolved around implementing a credit and refund configurator to allow operators to create credit and refund strategy decision trees (see Figure 1). The configurator required building a visual editor to arrange credit and refund decision trees using a drag-and-drop mechanism and creating APIs to store and fetch configuration data for the trees. In addition to the visual editor, we needed a framework to parse the configuration data and execute the actions that the tree specified. Client services needed an API to invoke this execution framework. But building these capabilities from scratch would be time and cost prohibitive.
Fortunately, we already had a homegrown decision tree-based configuration platform to configure business and technical flows without code. We leveraged this existing workflow platform to store and fetch configuration data for credit and refund strategies. To help operators define those strategies, we also added special types of nodes that could only be understood by the credit and refund platform. For example, the is_vertical_id_in_list node in Figure 1 checks the business vertical. It tells the next node if the order is, for example, a restaurant order, a grocery order, an alcohol order, or a pharmacy order. Based on the output of the is_vertical_id_in_list node, there would be different credit and refund strategies.
We had an experimentation platform at DoorDash, but experiments needed to be hard-coded by engineers. To save engineering time, we enhanced the workflow platform to configure an experiment without code. We added a new type of node, select_control_or_x_treatment (x is the number of treatment groups; see Figure 1), that allows operators to name an experiment. If a select_control_or_x_treatment node is configured as part of a decision tree, the workflow platform will leverage the APIs provided by the experimentation platform to execute the experiment and take the treatment or control path based on the results.
Exploring the technical architecture
At this stage, we were ready to put everything together.
To orchestrate credits and refunds, the Python application routed the traffic to the Kotlin application to determine strategies. The control flow then returned back to the Python application. The architecture behind the customer support platform, as shown in Figure 2, highlights how the Python and Kotlin systems work together to issue credits and refunds. This architecture transformed the way we tested and experimented with customer support resolution strategies.
After the redesigned system was rolled out, we saw significant improvement in how quickly our operators could respond to customer problems and define, test, experiment, and roll out credit and refund resolution strategies. As the behavior of the system changed, new challenges cropped up, including a need for transparency about the system’s configuration changes made. We also needed more system guardrails because we required stricter validation of the configuration data before it rolled into production. As our configuration-based decision-making system evolves, we are discovering new requirements, including a need for automated testing of resolution strategies to prevent production environment regressions created by inaccurate strategies.
Migrating systems from one technical stack to another is a complex endeavor. There is a natural temptation to redesign the system as part of the migration to eliminate technology debt and introduce best practices. When we did both at once, we kept an eye on maintaining functional feature parity between the two systems. After the technical migration was completed, we were able to verify that there were no regressions introduced. Subsequently, we cut traffic over to the new system, allowing newer functional requirements to be applied only to the new system.
Thanks to Abin Varghese, Han Huang, Kumaril Dave, and Borui Zhang for their contributions to this effort.