At DoorDash, we rely on experimentation to make decisions regarding model improvements and product changes because we cannot perfectly predict the results in advance. But experiments conducted by the Dasher team aimed at improving delivery quality and Dasher experience face challenges that arise from conflicting goals, including minimizing interference effects while simultaneously capturing learning effects and maximizing power.
Challenges in marketplace experimentation
In short, there is no one-size-fits-all solution for designing experiments that meets every demand for results. Our experiments walk a fine line between three key goals:
- Reduce interference (network) effects. The fact that DoorDash represents a three-sided marketplace introduces complex dependencies. In an order-level A/B test, treatment and control orders that occur simultaneously in the same geographic area will not be independent because they rely on the availability of a shared Dasher fleet. One way to solve this problem is to design a switchback experiment in which we randomly assign treatment vs. control to region/time window pairs instead of deliveries. More details about network effects and switchback can be found in this blog post.
- Capture learning effects. While switchback may solve the network effects problem, it cannot measure Dasher learning or other long-term effects such as Dasher retention because it cannot provide a consistent treatment experience for Dashers. Even if we can sometimes detect an effect with relatively high power in a short time period, we may need to run a longer experiment because it takes time – possibly several weeks – for Dashers to adjust their behaviors in response to a treatment.
- Maximize power. If we try to design experiments that capture learning effects and minimize network effects, we would end up with lower power tests that may struggle to detect changes in key metrics. For example, we could run a market-level A/B experiment to overcome network effects and capture learning effects, but it would have comparatively low power because the small number of markets relative to e.g. the number of Dashers limits our effective sample size. Alternatively, we could run two separate experiments – say, a switchback for network effects and a Dasher A/B for learning effects. While these two experiment types have medium and high power, respectively, running two separate experiments adds complexity and could extend the time required to obtain meaningful results.
Subscribe for weekly updates
The randomization trilemma
In most of our experiments, we are interested in multiple business metrics around Dasher experience and delivery quality, including such things as delivery duration and lateness, which involve both network effects and learning effects. Therefore, we may need to design each experiment differently based on the characteristics of different randomization units. The most common three randomization units that our team uses are switchback, Dasher A/B, and market A/B. The trilemma is outlined in Figure 1.
|Switchback||Minimal network effects, |
|Cannot capture learning effects|
|Dasher A/B||High power,|
Can capture learning effects
|Severe network effects|
|Market A/B||Little to no network effects, |
Can capture learning effects
Impact of randomization on power
Different randomization methods produce different power because deliveries in the same switchback units, Dashers, or markets are not independent. Consequently, we use cluster robust standard errors, or CRSE, to prevent potential false positive results. More details can be found in this blog post. As a result, the experimental power is not determined by the number of deliveries but instead by the number of clusters, which are the number of switchback units, Dashers, or markets in the three randomization methods described above. Various switchback units, Dashers, or markets may have different numbers of deliveries, and this volume imbalance can further reduce the effective power.
Note that we can increase the power of a switchback experiment by reducing the time window or region size, but these reductions come at the cost of potential increases in network effects. Switchback experiments are not completely free from network effects, they can still exist around the geographical and temporal borders between clusters. The smaller the switchback units, the more severe the network effects will be because a greater percentage of deliveries may be near time windows or geographical boundaries where network efforts may still be present.
It is also important to note that because of the limited number of markets, a simple randomization for market A/B may lead to pre-experiment bias between treatment and control for some metrics. As a result, we usually need to use stratification and diff-in-diff methods to better account for potential pre-experiment bias and ensure that the parallel trend assumption holds; synthetic control will be necessary if this assumption is violated.
Creating the optimal experiment design
In order to maximize use of our experiment bandwidth, we dynamically design experiments based on the expected trade-offs between network effects, learning effects, and power for each intervention. We may choose to run a single test or multiple test types in parallel to triangulate the true impact while still maintaining power.
Trade-offs between power and network effects
Not all of our experiments affect all of the orders in the system. There are experiments that only affect a subset of total orders. For instance an experiment that only impacts flower delivery may only directly impact 1% of total volume. Even if we randomize such a test at the order level, the network effects could be so small that we can ignore them because it’s unlikely these treatment and control deliveries will compete for the same group of Dashers. In this situation, we may opt not to run a switchback experiment, instead running a Dasher A/B or other A/B experiment to significantly increase power.
When we expect only strong network effects or learning effects
In this easy scenario, we can run a switchback experiment to mitigate network effects if we want to test a new delivery duration estimation model, or we can run a Dasher A/B experiment to address Dasher learning effects with high power if we want to test a Dasher app product change. We can follow the regular power analysis to determine the minimum experiment duration – usually rounded to weeks to avoid day-of-week effect – or calculate the minimum detectable effect, or MDE, given a minimum required experiment duration required by metric definition, such as Dasher retention. If there is no conflict with other experiments, we can run it in all markets to maximize the power.
When both strong network effects and learning effects are expected
This is the most common scenario that we face. In general, we could run a market A/B diff-in-diff experiment or two separate experiments – one switchback and one Dasher A/B – either by time or by markets.
A single market A/B diff-in-diff
A market A/B experiment is appropriate when it’s reasonable to expect that the power will be sufficiently high – in other words, when the expected effects from the treatment are significantly larger than the calculated MDE for the test. Running this single experiment avoids conducting two separate experiments, which adds complexity and could delay learning and decision making. However, because of the small number of markets, simple randomization may not give us comparable separation between treatment and control; therefore, we tend to use stratified randomization to split the markets to maximize power. For example, if we have two metrics – X1 and X2 – we may split all markets into, say, nine strata (high/medium/low based on each metric), and then randomly assign an equal number of markets in each stratum to treatment and control.
Two separate experiments
Market A/B, however, usually does not give us enough power. Instead, we have to conduct two different experiments: a Dasher A/B to measure Dasher retention and a switchback to measure delivery quality metrics. Because the two experiments cannot run simultaneously in the same markets, we have to allocate our limited experiment bandwidth either sequentially or geographically as illustrated in Figures 2 and 3.
1) Geographical separation
One way to run two experiments in parallel would be to allocate traffic geographically between a switchback and a Dasher A/B experiment. In such a geographical split, we can efficiently divide the markets between the two experiments in a way – for instance, 70/30 – that requires roughly the same number of weeks based on power analysis (see Figure 2a). An inefficient market split would require one experiment to run for four weeks while the other one needs to run for eight weeks, as illustrated in Figure 2b. Such an inefficient design delays the final decision-making for this experiment. Regardless of the optimal split ratio, we always recommend splitting the markets through stratified randomization to reduce any potential noise introduced by the split.
2) Sequential separation
We often prefer a sequential allocation over a geographical allocation to fast fail and iterate (see Figure 3b). This is particularly useful if we believe that aspects of the intervention require tuning before we commit to a longer-term learning test where impacts, by definition, will take multiple weeks to observe.
Consider an experiment where we want to measure both Dasher retention and delivery quality, but market A/B does not have enough power. We can run a switchback experiment first to measure quality effects and then run a Dasher A/B experiment to measure retention effects. If quality effects do not meet our expectations, we can quickly iterate and improve the treatment design to run another switchback experiment without having to wait for a long-term Dasher A/B experiment to complete. Once we are good with the quality metrics and finalize the treatment design, we can start the long-term Dasher A/B experiment to measure retention.
In DoorDash’s data-driven culture, experimentation drives how and when to ship product changes and iterate on model improvements. As a result, we need to develop rapid, high-power testing procedures so that we can quickly learn and make decisions. Because of the complex interactions between consumers, Dashers, and merchants on the DoorDash marketplace and platform, however, most of our interventions involve trade-offs between network effects, learning effects, and power. No single experimental framework can serve for all of the various interventions that we want to test. We therefore came up with a set of experiment design solutions that can be used under a variety of scenarios. By applying these methods, using variance reduction techniques, and running multiple experiments simultaneously, we are able to significantly improve our experimentation speed, power, and accuracy, allowing us to ship more product changes and model improvements confidently. These approaches are practical, easy to implement, and applicable to a wide variety of marketplace use cases.