DoorDash uses machine learning to determine where best to spend its advertising dollars, but a rapidly changing market combined with frequent delays in data collection hampered our optimization efforts. Our new attribution forecast model lets us predict the efficacy of ad campaigns from their initial data, helping make critical business decisions earlier.

Usually, we need to wait some time to measure ad performance due to our attribution methodology. This slow reaction time means that if one ad channel performs particularly well, we cannot move our marketing dollars to that channel quickly. 

Our new attribution forecast model predicts an ad’s final conversion volume after observing only its initial data. This allows us to utilize more recent data, optimizing our conversions per ad spend by scaling the best channels as performance changes over time.

Beyond routine optimization, this forecasting framework is especially useful during marketing experiments–where identifying the winning ad sooner accelerates impact–and could be extended to other delayed-information problems, such as ticket sales for concerts, hotel bookings, or holiday flower sales. 

Intro to attribution methodology

Before we talk about the forecast model, let’s discuss what an attribution system is and how to choose one. An attribution system helps companies measure the effectiveness of marketing campaigns. It has two key elements: allocation method and attribution window.

Reviewing allocation methods

The allocation method determines how to assign credit across marketing touchpoints when a new consumer places their first order, commonly called conversion. Allocation methods can be either single touch or multi-touch, depending on the number of channels that get credit for the conversion. See Figure 1 for an illustration.

  • Single touch assigns all conversion credit to a single channel, assuming that either the first or last touchpoint drives the conversion.
  • Multi-touch considers all touchpoints along the path to conversion. As a simple example, it might distribute credit to all touchpoints evenly.
Figure 1: The choice of allocation method affects how much credit each channel receives. In this example, the customer interacted with three marketing touchpoints before placing their first order: a DoorDash TV commercial, an ad on social media, and a paid keyword search ad.

Trade-off in attribution windows

Intuitively, the impact of marketing isn’t always immediate. After the customer in the example above saw the DoorDash commercial on TV, they might have taken days or even a few weeks before placing their first order. Thus, another key question in attribution methodology is how many days to look back when defining marketing touchpoints. This time period is the attribution window.

A long attribution window allows companies to recognize more conversions for any given ad. Figure 2, below, shows the difference in cost curve between two attribution windows. With a shorter attribution window (the seven-day attribution below in red), the new customers who converted after seven days aren’t credited to the ad, leading to a cost curve that underestimates the ad efficiency.

At the same time, conversion patterns are different across ads. For example, some ads, such as those on search channels, lead to a quick conversion, while ads on other channels, such as TV or radio, are slower. A shorter attribution window will cause us to underestimate the attribution for slower ads more, resulting in suboptimal marketing decisions.

Figure 2: Each dot represents one day’s marketing spend and the attributed new customers. When the attribution window is short, attributed conversion data is incomplete, resulting in an underestimating cost curve.

However, too long of an attribution window is also undesirable because it requires a longer wait time to fully measure an ad’s performance. In a rapidly changing market, longer wait times may result in out-of-date cost curves. If we could somehow include this data that isn’t available yet, the extra information would materially improve our cost curves and in turn marketing decisions, as shown in Figure 3:

Figure 3: With a long attribution window, the latest conversion data (red) is not available until many days after an ad is bought. The latest available data (black) is stale, yielding an out-of-date cost curve.

The problem with our legacy attribution system

Currently, DoorDash uses a several-day last-touch attribution system for all digital marketing channels, which provides a good balance between a holistic view of conversions for most ads, and reasonable wait time for fully refreshed attribution performance. 

However, an attribution window of several days still means that ads posted in the last few days are operating off of incomplete attribution data, which can’t inform marketing decisions until the window has elapsed. Given the rapid changes in the food delivery marketing landscape, having to wait before reacting to recent data isn’t ideal. We needed a way to unlock our most recent data.

Forecasting final outcomes

Before we jump into the details, let’s discuss an ideal solution. Our ad channels come in different shapes and sizes. For example, we run a small number of high-spend TV ads and a huge number of low-spend search ads (sometimes targeting individual obscure keywords, like misspellings of “DoorDash”). Ideally, our solution should handle both small and large ads.

The approach we chose was to build a forecasting model that predicts final attribution data from a limited amount of initial data.

Defining forecast accuracy

An easy way to measure the performance of our forecast model is by backtesting. Backtesting means training the model on old data and checking whether it can predict more recent data.

The main performance metric we picked is mean absolute error (MAE), 

where ci is conversions attributed to ad i and the hat ^ distinguishes a prediction from the actual value. Because MAE simply takes the absolute value of forecasting errors, it isn’t biased toward larger ads (unlike root mean square error, RMSE) or smaller ones (unlike mean absolute percentage error, MAPE).

However, one pitfall of MAE is it scales with conversion volume, which makes it harder to compare across channels or other segments, such as day of week. To facilitate comparison, we normalized MAE by conversion volume:

Building the forecast model

We wanted a forecast model that updates its predictions as we collect more data. It should be able to predict the final attribution outcome, whether there are four days of observations to work with or ten. The more of this initial data the model has, the more accurate the forecast should be.

We evaluated two types of models: 

  • Simple heuristic models 
  • Machine learning models

Simple heuristic models

The simplest models we considered assume the conversion pattern for an ad will be the same in the future as the past N weeks. For example, suppose we want to predict the number of conversions attributed to an ad at the end of a 30 day window, c(30 d). The prediction on day t (tth day after the ad is posted) is

where c(t) is the number of attributed conversions observed so far. This approach directly applies a historical ratio to predict final conversions from the current observation.

Below are some of the parameters or variations we explored with this heuristic model. We selected the best parameterized model by backtesting, as described in the previous section.

  • Number of weeks N used to calculate historical conversion ratio. This parameter corresponds to the question of how long the conversion pattern stays the same: too long (larger N) might be slow to capture market changes, while too short (smaller N) might be noisy. We considered values from one to twelve weeks.
  • Aggregation. Related to the previous point, small ads might generate too little data to confidently calculate the historical conversion ratio. Aggregating similar (e.g., same channel, creative, or region) ads when calculating the ratio can decrease noise.
  • Seasonality adjustments. Seasonality, especially day of week, plays an important role in our new customer conversions. For example, a consumer is more likely to place their first order on a weekend night than a Tuesday night. To account for that, we could  calculate a different historical ratio for each day of the week.

Machine learning models

This forecast is a typical regression problem. We tested the following machine learning regression methods:

Results

As shown in Figure 4, below, the LightGBM and simple heuristic models significantly outperform the other models.

Figure 4: Each bar represents the average normalized MAE across the five backtesting weeks. The error bar on the top shows the standard error.

However, how would this accuracy improvement translate to better marketing decisions? To better understand the true impact, we plugged the forecast model predictions back into our downstream workflow and used them to draw cost curves. Figure 5, below, shows that when these predictions are included, an example ad’s cost curve captures spend efficiency more accurately, which in turn helps us assign our marketing budget more optimally. In this case, without the forecast predictions we would underestimate the ad’s performance and mistakenly move some of its budget to other channels.

Figure 5: When only historical actuals (black dots) are used to construct a cost curve (black line), the curve poorly predicts future ad performance (blue triangles). But when forecast model output (red squares) is also included, the cost curve (red line) becomes more accurate, as in Figure 3.

As with forecast accuracy, we backtested cost curve accuracy using the same approach and the same normalized MAE metric. Currently, the cost curves are able to achieve a reasonable normalized MAE with delayed historical data, as shown in Figure 6. By plugging in the forecast predictions, the best models (simple heuristic and LightGBM) further decrease the error significantly.

Figure 6: Adding the more recent predicted attribution data from the two best forecast models (heuristic and LightGBM) significantly improves cost curve accuracy.

In conclusion, by taking advantage of recent data more quickly, the attribution forecast model significantly enhances our ability to draw cost curves and make marketing decisions. We found that the heuristic and LightGBM models performed similarly, so we chose to productionize the heuristic model because it’s simpler and more interpretable.

Summary and next steps

At this point one might ask, why does the simplest model perform best? We think there are two reasons:

  • Strong existing patterns: Consumer conversions after a marketing touchpoint usually follow a particular pattern. The majority of consumers are going to convert in the first few days, and the number gradually reduces. External factors play a relatively small role in impacting consumer behavior when the onboarding funnel is short. Therefore, a simple heuristic adequately captures the conversion flow.
  • Limited amount of data: Typically, we have only several days of observations on which to base a forecast prediction. With this small amount of data, more complex ML models don’t show many advantages. 

When patterns are less obvious, or the ads of interest span a hierarchy of different regions or countries, a simple heuristic might not perform as well and a more advanced model might be justified.

A similar methodology could be applied to attribution systems at other companies. Depending on the data available, an easy model like the simple heuristic could be a great place to start. Beyond marketing attribution, applications could also include other delayed-response situations. For example, predicting final concert ticket volume from the first few days of sales, or predicting final hotel occupancy on any given day from early bookings. Learning the final outcome sooner enables faster reactions and better decisions.