For any data-driven company, it’s key that every change is tested by experiments to ensure that it has a positive measurable impact on the key performance metrics. At DoorDash, there are thousands of experiments running in parallel every month. The results of every experiment must be analyzed quickly and accurately to ensure this volume of testing is possible. Running thousands of experiments can be a challenge because increased complexity, more varied needs, and sheer scale can be hard to overcome.
To overcome these challenges and enable our product teams to operate faster at scale we built Dash-AB, a centralized library for statistical analysis. This library, connected to our experimentation platform — Curie — enables us to test uniformly according to established best practices and reuse complex statistical methods. This post will take a look under the hood of the challenges we had with experimentation and what statistical methods we defined in order to build Dash-AB.
The challenges of having minimal experiment standardization
To support testing every feature via experiments before it can be launched, we needed to standardize our experimentation practices. Previously there was no standardization for experiment analysis at DoorDash, which made it difficult for analysis to keep up with the rate of experiments. Specifically, lack of standardization caused:
- Error-prone analysis results. For example, in the past we have seen many teams fail to account for clusters in Diff-in-Diff analysis and switchback analysis which leads to a high false positive rate.
- Wasted effort in data scientists reinventing the wheel. Every team worked on their own implementation, while the same problem had already been solved by someone else in the company.
- Slow learning. Some teams at DoorDash run more complex experiments and perform more advanced methodological research, but this gained expertise could not be easily shared with other experimenters.
We needed a way to enforce standardization and facilitate knowledge sharing to improve the velocity of our experiment analysis. Thus, to increase our experimentation capacity we developed a statistical engine to standardize experiment analysis and operate as a centralized place to incorporate best practices.
Building a centralized library for statistical analysis
To solve these standardization problems, we built a central library called Dash-AB, which serves as the central statistical analysis engine to encapsulate all the methodologies used at DoorDash and empowers all the experiment analysis calculations for different use cases.
Building a central statistical analysis engine is challenging because of the wide variety of different use cases across DoorDash’s product teams (e.g. experiments on the logistics side are very different from experiments on the consumers side). The library needs to satisfy the following requirements:
- Trustworthiness is the highest priority: We need to guarantee the quality of any experiment result, which should be reasonably accurate (consistency), robust to misspecification of model (robustness), and sensitive enough to the business or product change (power).
- Accessible to the team: We want to make sure users can quickly and efficiently understand how to use the library. Concretely, it has clear input and output exposed to users. The result should be interpretable and intuitive instead of causing ambiguity.
- Able to scale with the fast-growing experiment volume: Similar to other engineering systems, which support the hyper-growth of DoorDash’s business, Dash-AB always needs to be able to scale to meet the volume of experiments. The library should be able to deliver results in a timely manner even with larger inputs.
- Provide complete coverage of the problem domains at DoorDash: The library needs to handle different use cases for different product teams. It includes methodologies for different types of experiments and metrics. It also includes more cutting-edge features to improve the experimentation velocity.
Subscribe for weekly updates
Building a simple user interface
The first step in building Dash-AB was to make a user interface that even people who were not familiar with statistics could easily use. Experimenters only need to provide Dash-AB with a json config (as below) of experiment and metrics information, and the library will do all the complex calculations. The code sample below is an example of the json config. The experiment randomizes on consumer_id, at a 50/50 split. With this input, Dash-AB will calculate the results for the metric “number_deliveries” and a covariate “number_deliveries_cov” will be added for variance reduction.
{
"columns": {
"number_deliveries": {
"column_type": "metric",
"metric_type": "ratio",
"numerator_column": "total_deliveries",
"denominator_column": "total_consumers",
},
"number_deliveries_cov": {
"column_type": "covariate",
"value_type": "ratio",
"numerator_column": "total_deliveries_cov",
"denominator_column": "total_consumers_cov",
},
"bucket": {
"column_type": "experiment_group",
"control_label": "control",
"variation": ["control", "treatment"],
"variations_split": [0.5, 0.5]
}
},
"experiment_settings": {
"type": "ab"
}
}
Providing different methods to support experiment use cases
Now that we have an understanding of how the user interface looks, we can talk about what is inside Dash-AB to make it useful for all our different experimentation use cases. Figure 1 illustrates how the data flows in Dash-AB. We will go through different methodologies for each step of the process. Essentially, whenever a new experiment is put into our experimentation platform it will go through a pipeline of:
- Validation and preprocessing
- Variance calculation and variance reduction
- Hypothesis testing
- Sequential testing VS fixed horizon test
Since each of these steps can have a number of different approaches, the next sections will go through each part of the process and explain what methodology options are available.
Data validation / preprocessing
Before analysis starts, Dash-AB runs a few validation checks to ensure the quality of data, including:
- Imbalance test to check if the bucket ratio is as configured. A sample ratio mismatch could cause bias in the experiment results. Thus it’s important to run this check to ensure the validity of results from the beginning. By default, a chi-square test is used to check if there is a mismatch.
- Flicker test to check if there are any entities that are bucketed to both the treatment and control groups. If the flicker ratio is high, this issue could also cause bias in the results.
Variance calculation
After validating the quality of data, Dash-AB starts the calculation of variance, which is the most important step of the whole pipeline. The extensive methodologies in Dash-AB for calculating variance make the support for different experiment use cases possible. These use cases include:
- different types of metrics, such as ratio, discrete, or continuous
- different types of metric aggregation functions, such as average treatment effect, quantile treatment effect
- variance reduction
- analysis of interaction effect between multiple experiments or features
There are three types of techniques used to calculate the variance in Dash-AB. Based on the config input from the users, the library chooses the method, but users can also override the default and choose the desired method for their analysis:
- Regression based method: This is the most straightforward approach in terms of implementation, given there are many external libraries available outside and it is applicable for many complex use cases:
- Firstly, adding covariates for variance reduction is very easy to achieve in a regression.
- Secondly, with Cluster Robust Standard Error, this model can handle clustered data, which is very common in our logistics experiments.
- Lastly, it can be useful to calculate interaction effects by adding interaction terms to the regression.
Because of its many benefits, regression was widely used in the early stage of Dash-AB. However, the downsides of the regression based method soon surfaced:
- Regression comes with high memory costs and high latency. At DoorDash, our experiments usually involve large quantities of data and it’s very easy to run into long latency and out of memory issues.
- It also doesn’t work for ratio metrics where there is only one data point for each bucket.
- Delta-method based: The Delta method allows us to extend the asymptotic normality to any continuous transformation; thus, we are able to calculate the variance of a ratio metric or quantile effect analytically. Because we no longer need to compute complex matrix operations, which is needed in regression based methods, the adoption of delta methods reduces the calculation latency and memory usage significantly.
- Bootstrapping: Dash-AB also offers bootstrap based methods for use cases that are not covered, for example, when the data size is too small. Bootstrap-SE and Bootstrap-t are the two main functions provided in Dash-AB.
Variance reduction
Variance reduction is a procedure used to increase the power or sensitivity of the experiment so that experiments can conclude faster. In Dash-AB, the control covariates method is used most commonly for variance reduction purposes. There are two main types of variance reduction methods that are used at DoorDash today:
- CUPED: uses pre-experimental average metric values as covariates.
- CUPAC: uses ML predictions as covariates.
The adoption of these two variance reduction methods at DoorDash generally helps us reduce the needed sample size for statistical significance by 10%- 20%.
Sequential testing VS fixed horizon test
After all the standard error and treatment effect calculations are finished, Dash-AB goes to the hypothesis testing phase, where it calculates statistics like the p-value and confidence interval. Today, Dash-AB offers two types of tests for different use cases:
- Fixed horizon test: This is a regular t-test to measure the confidence interval and p-value. It is the most commonly used test for randomized experiments. However, one down side of this test is that the length of the experiment needs to be decided before the experiment starts. To decide the test duration, we need to estimate the minimum detectable effect (MDE), which can be very difficult. Overly aggressive or conservative estimates can cause issues. Durations that are too long reduce development velocity and durations that are too short undermine the power of the test. Another downside is the peeking issue. Experimenters are not supposed to read the results before the planned end date. Failure to do so can result in an increased false positive rate. However, in practice this is usually very hard to achieve, as teams typically closely monitor the results for any unexpected effects.
- Sequential testing: In order to solve the peeking issue and speed up the experiment’s process, we developed sequential testing in Dash-AB, which uses a mSPRT test to calculate an always valid p-value and confidence interval. Sequential testing guarantees always valid results, so experimenters are free to look at the results any time during the experiment.
Sample size calculation
As previously discussed, sample size calculation is needed to decide the end date for an experiment. Dash-AB provides a pipeline to support sample size calculation. It shares the same variance calculation component as well as a similar interface with the AB pipeline. A sample size calculator UI was also built into our platform to support the calculation.
{
"columns": {
"number_deliveries": {
"column_type": "metric",
"metric_type": "ratio",
"numerator_column": "total_deliveries",
"denominator_column": "total_consumers",
"absolute_treatment_effect": 0.1,
"power": 0.8
},
"number_deliveries_cov": {
"column_type": "covariate",
"value_type": "ratio",
"numerator_column": "total_deliveries_cov",
"denominator_column": "total_consumers_cov",
},
},
"experiment_settings": {
"type": "ab"
}
}
Diff-in-Diff Analysis
When experimenting on a regional level, Diff-in-Diff analysis sometimes is used, given that we have a very limited number of regions for randomized testing. Dash-AB provides a Diff-in-Diff pipeline which handles the matching process as well as the analysis of p-values and confidence intervals.
{
"columns": {
"number_deliveries": {
"column_type": "metric",
"metric_type": "ratio",
"numerator_column": "total_deliveries",
"denominator_column": "total_consumers",
},
"submarket_id": {
"column_type": "experiment_randomize_unit",
},
"date": {
"column_type": "date"
}
},
"experiment_settings": {
"type": "diff_in_diff",
"treatment_unit_ids": [1,2,3],
"match_unit_size": 5,
"matching_method": "correlation",
"matching_start_date": "2021-01-01",
"matching_end_date": "2021-02-01",
"experiment_start_date": "2021-03-01",
"experiment_end_date": "2021-06-01",
"matching_columns": ["number_deliveries"],
"matching_weights": [1],
}
}
How Dash-AB empowers statistical calculation behind our centralized experimentation platform
Besides being adopted by data scientists to run analysis locally, Dash-AB is also the central component to our experimentation platform Curie. Curie is DoorDash’s experimentation platform that provides an abstracted UI interface for setting up experiments and measuring their results. At the core of this system is Dash-AB. Curie sends the data and configuration to Dash-AB and Dash-AB handles the statistical analysis and returns the results to Curie.
Conclusion
In a data-driven world where DoorDash runs massive numbers of experiments, we want to empower all teams to make improvements with speed, rigor, and confidence. To achieve this, we invested in building a statistical engine library, Dash-AB, which standardizes core experimentation frameworks such as A/B, Switchback, and advanced techniques such as CUPED, CUPAC, interaction analysis. It also powers the analysis platform Curie for automation which reduces the time people need to spend on analysis.
For other companies that are working towards increasing their development while remaining data-driven, having an experimentation engine like Dash-AB will go a long way towards speeding up experiment velocity. This post should be a useful guide in how to think about the development challenges as well as the different methodologies experiments need in order to be trustworthy and efficient.
Acknowledgement
Without a doubt, building such a statistics library to support the experimentation platform for a diverse set of use cases requires more than just a team; it requires a whole village. Firstly, we would like to acknowledge all other founding members of Dash-AB: Kevin Teng, Sifeng Lin, Mengjiao Zhang and Jessica Zhang. Secondly, we would like to thank the close partners from the data science community at DoorDash: Joe Harkman, Abhi Ramachandran, Qiyun Pan and Stas Sajin. And lastly, we are proud of and appreciate the other Experimentation Platform team members who have been making continuous development and maintenance with customer obsession - Arun Kumar Balasubramani, Tim Knapik, Natasha Ong and Michael Zhou. We also appreciate the continuous support from leadership: Bhawana Goel, Jessica Lachs, Alok Gupta, Sudhir Tonse. Thanks to Ezra Berger for helping us write this article.
Special thanks to Professor Navdeep S. Sahni for providing guidance and counseling on the development of the framework and experimentation science at DoorDash.