Experimentation lets companies such as DoorDash trial new features among a limited group of users to gauge their success. For example, we might try showing personalized restaurant recommendations in a grid format as opposed to a list format on our app. If data shows that our experimental group of customers like the grid format, we might roll that feature out to our entire user base. These experiments can also be performed in non-consumer facing aspects of the application. For example, we experiment on different algorithms for delivery assignments to Dashers and choose the best algorithm based on our internal metrics, such as delivery time (time taken to complete a delivery) or number of completed deliveries.

In an effort to derive the highest quality data possible from these experiments, we developed Curie, our new experimentation analysis platform. Curie standardizes our experimentation analysis processes and results, and makes the data accessible to all stakeholders. 

Built on a combination of SQL, Kubernetes, and Python for its main components, we named Curie after the famous scientist Marie Curie, honoring her experiments in radioactivity. Curie is designed to standardize our experiment analysis processes, including A/B tests, Switchback tests, and Diff-in-Diff analysis. This platform will help us make data-driven decisions based on the analysis results by validating the statistical significance (p-value) and treatment effects on metrics of interest.

Experimentation challenges before Curie

Prior to Curie, DoorDash’s data scientists analyzed their experiments in their local environment using SQL queries or ad-hoc scripts. This process was time consuming and error-prone. It also lacked standardization, as everyone used different methods for analysis, which can affect the quality and accuracy of the results. Lacking a centralized platform, experiment results were scattered in the form of documents or emails. 

Given the different types of experiments we ran and the scale at which our data was growing, it was difficult to find a single tool that catered to all our needs. We also wanted to leverage open source tools and integrate with our data and internal tools, including ACLs and workflows.

To address the above problems, we decided to build our own in-house experimentation analysis platform to enjoy all the benefits of automated analysis. We designed the platform to improve the accuracy of the experiment results by using standard and scientifically sound methodologies proposed by our experimentation working group. The analysis platform would also provide a centralized environment to showcase experiment results and make it easy to share them across the company. Experiment results would be precomputed and readily available so that the data scientists need not wait for the completion of long-running queries. 

The experiment life cycle at DoorDash

Before getting into the details of Curie, let’s first understand how it will be used during the life cycle of an experiment at DoorDash. As an analysis platform, Curie ingests data from experiments we have conducted and performs scientific analysis on the metrics, a process previously carried out by ad hoc scripts, as shown in Figure 1, below:

Diagram of DoorDash's experimentation methodology.
Figure 1: The life cycle of an experiment goes through multiple components of the experimentation platform, including configuration, instrumentation, and analysis.

The following sequence explains the life cycle of an A/B experiment randomized on users:

  1. The experimenter calculates the sample size required for an experiment by inputting their desired power and Minimal Detectable Effect (MDE). In this case, the sample size will be the number of users, as we are experimenting on users. In this step, the experimenter defines the allocation of the users among the control group (the users who do not see the new feature) and the treatment group (the users who are exposed to the new feature) based on a specific criteria, such as country or device type. Initially, we start with a small number of users in the treatment group and, depending on the preliminary results of the experiment, we gradually increase the treatment allocation until we completely allocate all the users to the treatment group. 
  2. The experimenter also sets up the experiment analysis configuration in the Curie web-based interface (WebUI). This configuration includes the list of metrics that need to be analyzed for this experiment. 
  3. When a user opens the DoorDash app, they will be randomly bucketed into either the control or treatment variation based on the allocation ratio specified in the first step. 
  4. This bucket assignment along with some contextual information (which we call experiment exposure events) are then logged in the data warehouse by our instrumentation infrastructure. 
  5. Curie performs the analysis using the exposures and metric data from the data warehouse and stores the results in the datastore.
  6. The analysis results will now be available on the Curie WebUI. The experiment will run for a period of time until we reach the required sample size. The experimenter can monitor the analysis continuously on Curie to confirm that the experiment does not have any negative effects on important metrics.

Curie’s components

Let’s now zoom into Curie’s architecture. Curie is an end-to-end system, where multiple components such as WebUI, workers, stats engine, and metric definitions collectively function to analyze experiments and get the results back to the user, as shown in Figure 2, below:

Architecture of Curie experimentation platform
Figure 2: Curie analyzes all the experiments asynchronously by using a job queue and worker setup. This design enables us to analyze the experiments on both scheduled and on-demand basis.

Metric definitions

Curie provides maximum flexibility to data scientists, letting them define their own metrics. Data scientists use SQL query templates to define their metrics in the Curie repository, as shown below: 

with
exposures as (
  SELECT
    exp.BUCKET_KEY as user_id,
    MIN(exp.RESULT) as bucket,
  FROM PRODUCTION.EXPERIMENT_EXPOSURE exp
  WHERE exp.EXPERIMENT_NAME = {{experiment_name}}
  and exp.EXPOSURE_TIME::date between {{start_date}} and {{end_date}}
  AND exp.EXPERIMENT_VERSION = {{experiment_version}}
  group by 1
  having count(distinct exp.RESULT) = 1
  order by 1
),

SELECT exposures.*, 
 metric1,
 metric2,
 FROM exposures exp
 LEFT JOIN metric_table metrics
 ON metrics.user_id = exp.user_id

We dynamically generate the query using JinjaSQL templates by binding the SQL parameters with the values from the experiment configuration. The above snippet represents the structure of the SQL templates used for analysis. It fetches the experiment exposures, i.e., the bucket assigned to the users and the metrics for those users. 

As can be seen in the template, all the experiment details, including experiment name, experiment version, and experiment date range, are parameterized and will be substituted with the values from the Curie configuration. Parametrizing the experiment specific details in the metric definitions allows data scientists to reuse a single SQL query for multiple experiments run by their team as most of the teams monitor a similar set of metrics for all of their experiments.

We consider use of these templates as the first step in centralizing all the company’s important metric definitions. There is currently an ongoing effort to standardize these metric definitions and create a metrics repository that can allow data scientists to create and edit individual metrics and reuse them across different experiments and teams.

Curie workers

We have a set of workers, each comprised of a Kubernetes pod with necessary resource constraints, that run the actual analysis for all the experiments. A cron job is scheduled to run every morning which triggers these workers by adding the tasks into the job queue (as shown in Figure 2, above). 

Once a worker receives a task, it fetches the experiment exposures and metrics data required for analysis from the data warehouse and performs the analysis using our Python stats engine. The results are then stored in a PostgreSQL data store with proper indexing for visualization in the Curie WebUI.

We also provide flexibility for users to trigger the experiment analysis at any time, which turned out to be a very useful feature as users did not want to wait for the cron schedule to validate their results. For example, if there was a bug in the SQL query used by the cron schedule, the user might want to fix the query and view the results immediately.

Python stats engine

We have an in-house Python stats library developed by our data scientists to analyze the metrics for experiments. This library analyses different types of metrics: 

  • Continuous Metrics, which have a continuous numeric value, e.g., total delivery time.
  • Proportional Metrics with a binary (0/1) value, e.g., user checkout conversion, which says whether a user completed a checkout after being exposed to an experiment.
  • Ratio Metrics, which is a ratio of two different continuous metrics, e.g., number of support tickets per delivery, where the numerator is the count of tickets and denominator is the count of deliveries.  

Based on different factors, including metric type and sample size, the library applies different methodologies, such as linear model, bootstrapping, and delta method to compute the p-value and standard error. Clustering is very common in DoorDash’s experiments. For example, all deliveries from a particular store form a cluster.  We use multiple methods to adjust the standard error to avoid false positives due to data clustering, such as the Cluster Robust Standard Error (CRSE) method in linear model, delta method, and cluster bootstrapping. We selectively apply variance reduction methods to reduce the noise in the results and improve the power of the experiments. The library also runs imbalance tests to statistically detect imbalance in the bucket sizes for A/B tests. 

Exploring Curie’s WebUI

Architecture of Curie's web interface
Figure 3: We used a web-based interface for Curie, allowing data scientists and other users to access the platform from their browsers.

Curie’s user interface, built on React, is used to set up experiment analysis configuration and visualize analysis results. This interface is backed by a gRPC service and BFF (Backend-For-Frontend) layer to interact with the datastore.

Conclusion

An experiment analysis platform is very important for automation and faster iteration on new features. It acts as the data scientist’s best friend, analyzing the experiments for them so they can focus their efforts and time on other crucial aspects of experimentation. DoorDash data scientists are adopting Curie to improve their experimental velocity and more quickly determining which new features best serve our customers. Currently, we are working on converting this MVP into a more stable platform with features such as standard metrics, results visualization, and advanced statistical methodologies.

We believe our platform employs a modern architecture and technologies that makes it very useful for our data scientists and extensible for the future. Curie may serve as an example for other companies building out an experimentation practice to improve their own apps and offerings.

Acknowledgements

Thanks to Yixin Tang and Caixia Huang for their contributions to this platform, and Sudhir Tonse, Brian Lu, Ezra Berger, and Wayne Cunningham for their constant feedback on this article.