Long-tail events are often problematic for businesses because they occur somewhat frequently but are difficult to predict. We define long-tail events as large deviations from the average that nevertheless happen with some regularity. Given the severity and frequency of long-tail events, being able to predict them accurately can greatly improve the customer experience. 

At DoorDash, we encountered this long-tail prediction problem with the delivery estimated arrival times (ETAs) we show to customers. Before a customer places an order on our platform, we provide an ETA for their order estimating when it will be delivered. An example of such an estimate is shown in Figure 1, below. 

The ETA, which we use predictive models to calculate, is our best estimate of the delivery duration and serves to help set customer expectations for when their order will arrive. Delivery times can often have long-tail events because any of the numerous touchpoints of a food delivery can go wrong, making an order arrive later than expected. Unexpectedly long deliveries lead to inaccurate ETAs and negative customer experiences (especially when the order arrives much later than the ETA suggested), which creates reduced trust and satisfaction in our platform as well as higher churn rates. 

Figure 1: When the ETA time that customers see before making an order ends up being wrong, it hurts the customer experience and degrades trust in our platform. 

To solve this prediction problem we implemented a set of solutions to improve ETA accuracy for long-tail events (which we’ll simply call “tail events” from here on out). This was achieved primarily through improving our models in the following three ways: 

  • Incorporating real-time delivery duration signals
  • Incorporating features that effectively captured long-tail information 
  • Using a custom loss function to train the model used for predicting ETAs

Tail events and why they matter 

Before we address how we solved our problem of predicting tail events, let’s first discuss some concepts around tail events, outliers, and how they work in a broader context. Specifically we will address: 

  • The difference between outliers and tail events 
  • Why predicting tail events matters 
  • Why tail events are hard to predict 

Outliers vs. tail events 

It’s important to conceptually distinguish between outliers and tail events. Outliers tend to be extreme values that occur very infrequently. Typically they are less than 1% of the data. On the other hand, tail events are less extreme values compared to outliers but occur with greater frequency. 

Many real-life phenomena tend to exhibit a right-skew distribution with tail events characterized by relatively high values, as shown in Figure 2. For example, if you look at the daily sales of an online retailer over the course of a year, there will likely be a long-tail distribution where the tail-events represent abnormally high sales on national or commercial holidays, such as Labor Day or Black Friday. 

Figure 2: Datasets with a nontrivial proportion of high values tend to be right skewed where the average is greater than the median value. This is common when looking at things that have an uncapped upper limit.

Why tail events are worth predicting

While outliers can be painful, they are so unpredictable and occur so infrequently that businesses can generally afford to dedicate the resources required to deal with any aftermath. In the online retailer example, an outlier might look like a sudden spike in demand when their product happens to be mentioned in a viral social media post. It’s typically very difficult to anticipate and prepare for these outlier events ahead of time, but manageable because they are so rare. 

On the other hand, tail events represent occurrences that happen with some amount of regularity (typically 5-10%), such that they should be predictable to some degree. Even though it’s difficult to predict the exact sales volume on holidays, retailers still take on the challenge because it happens frequently enough that the opportunity is sizable.

Why are tail events hard to predict?

While tail events occur with some regularity, they still tend to be difficult to predict. The primary reason is that we often have a relatively small amount of data in the form of ground truth, factual data that has been observed or measured, and can be analyzed objectively. Lack of sufficient ground truth makes building a predictive model challenging because it’s difficult for the model to learn generalizable patterns and accurately predict tail events. Going back to the online retailer example, there are only a handful of holidays in a year where the data indicates a tail event occurrence, so building a reliable trend to predict holiday sales is not easy (especially for newer online retailers that have only experienced, and collected data for, a few holiday seasons).

A second reason why tail events are tough to predict is that it can be difficult to obtain leading indicators which are correlated with the likelihood of a tail event occurring. Here, leading indicators refer to the features that correlate with the outcome we want to predict. An example might be individual customers or organizations placing large orders for group events or parties they’re hosting. Since retailers have relatively few leading indicators of these occurrences, it’s hard to anticipate them in advance.

DoorDash ETAs and tail events

In the DoorDash context, we are mainly concerned with predicting situations where deliveries might go over their normal ETA time. To solve this problem, we first need to delve into DoorDash’s ETAs specifically and figure why orders might go over their expected ETA so we can account for these issues and improve the accuracy of our model. 

The challenge of setting good ETAs 

Our ETAs predict actual delivery durations, which is defined as the time between when a customer places an order on DoorDash and when the food arrives. Actual delivery durations are challenging to predict because many things can happen during a delivery that can cause it to be delayed. These delays cause inaccurate ETAs which are a big pain point for our customers. People can be extra irritable when they’re hungry!

While it may seem logical to overestimate ETAs, the prediction is actually a delicate balancing act. If the ETA is underestimated (or too short), then the delivery is more likely to be late and customers will be dissatisfied. On the other hand, if the ETA is overestimated (too long), then customers might think their food will take too long to arrive and decide not to order. Generally, our platform is less likely to suggest food options to customers that we believe will take a very long time to arrive, because overestimations reduce selection as well. Ultimately, we want to set customer expectations to the best of our ability — balancing speed vs. quality.

Tail events in the ETAs’ context 

A tail event in the ETAs’ context is a delivery that takes an unusually long time to arrive. The vast majority of deliveries arrive in less than 30 minutes, but there is high variance in delivery times because of all the unexpected things that can happen in the real world which are difficult to anticipate. 

  • Merchants might be busy with in-store customers 
  • There could be a lot of unexpected traffic on the road
  • The market might be under-supplied, meaning we don’t have enough Dashers on the road to accommodate orders
  • The customer’s building address is either hard to find or difficult to enter

Factors like these lead to a right-skewed distribution for actual delivery times, as shown in Figure 3, below:

Figure 3: Most DoorDash deliveries arrive in 30 minutes or less, but the long tail of orders that stretch more than 60 minutes make our actual delivery durations right skewed by nature.

Improving ETA tail predictions 

Our solution to improving the ETA accuracy of our tail events was to take a three-pronged approach to updating our model. First, we added real-time features to our model. Then we utilized historical features that were more effective at helping the algorithm learn the sparse patterns around tail events. Lastly, we used a custom loss function to optimize for prediction accuracy when large deviations occur.

Starting with feature engineering

Identifying and incorporating features that are correlated with the occurrence and severity of tail events is often the most effective way to improve prediction accuracy. This typically requires both:

  • a deep understanding of the business domain to identify signals that are predictive of the tail events.
  • a technical grasp of how to represent this signal in the best way to help the model learn. 

Initially, we established the right north star metric to improve prediction accuracy. In our case we utilized on-time percentage, or the percentage of orders that had an accurate ETA with a +/- margin of error as the key north star metric we wanted to improve. Next, the team brainstormed changes to the existing feature set and the existing loss function to push incremental improvements to the model. In the following sections, we discuss: 

  • Historical features 
  • Real-time features
  • Custom loss function

Historical features

We found that bucketing and target-encoding of continuous features was an effective way to more accurately predict tail events. For example, let’s say we have a continuous feature like a marketplace health metric (ranges 0-100) that captures our supply and demand balance in a market at a given point in time. Very low values of this feature (e.g. <10) indicate extreme supply-constrained states that lead to longer delivery times, but this doesn’t happen frequently. 

Instead of directly using marketplace health as a continuous feature, we decided to use a form of target-encoding by splitting up the metric into buckets and taking the average historical delivery duration within that bucket as the new feature. With this approach, we directly helped the model learn that very supply-constrained market conditions are correlated with very high delivery times — rather than relying on the model to learn those patterns from the relatively sparse data available.  

Real-time features

We also built real-time features to capture key leading indicators on the fly. Generally speaking, it’s extremely challenging to anticipate all of the events, such as holidays, local events, and weather irregularities, that can impact delivery times. Fortunately, by using real-time features we can avoid the need to explicitly capture all of that information in our models. 

Instead, we monitor real-time signals which implicitly capture the impact of those events on the outcome variable we care about — in this case, delivery times. For example, we look at average delivery durations over the past 20 minutes at a store level and sub-region level. If anything, from an unexpected rainstorm to road construction, causes elevated delivery times, our ETAs model will be able to detect it through these real-time features and update accordingly.

Using a quadratic loss function 

A good practice when trying to detect tail events is to utilize a quadratic or L2 loss function. Mean Squared Error (MSE) is perhaps the most commonly used example. Because the loss function is calculated based on the squared errors, it is more sensitive to the larger deviations associated with tail events, as shown in Figure 4, below:

Figure 4: Quadratic loss functions are preferable to linear loss functions since they are more sensitive and the amount of relevant data is sparser when predicting unlikely events.

Initially, our ETA model was using a quantile loss function, which is a linear function. While this had the benefit of allowing us to predict a certain percentile of delivery duration, it was not effective at predicting tail events. We decided to switch from quantile loss to a custom asymmetric MSE loss function, which better accounts for large deviations in errors.

Quantile loss function:

with 𝞬∈(0,1) as the required quantile

Asymmetric MSE loss function:

with α∈(0,1) being the parameter we can adjust to change the degree of asymmetry

In addition, asymmetric MSE loss more accurately and intuitively represented the business trade-offs we were facing. For example, by using this approach we need to explicitly state that a late delivery is X times worse than an early delivery (where X is equal to the following part of the equation above).

Results 

After observing accuracy improvements in offline evaluation, we shadowed the model in production for two weeks to verify the improvements transferred to our online predictions. Then, we ran a randomized experiment to measure the impact of whether our model could improve targeted ETA metrics as well as key consumer behaviors. 

Based on the experiment results, we were able to improve long-tail ETA accuracy by 10% (while maintaining constant average quotes). This led to significant improvements in the customer experience by reducing the frequency of very late orders, particularly during critical peak meal times when markets were supply-constrained.

Conclusion

We have found the following principles to be really useful in modeling tail events:

First, investments in feature engineering tend to have the biggest returns. Focus on incorporating features that capture long-tail signals as well as real-time information. By definition, there typically isn’t a lot of data on tail events, so it’s important to be thoughtful about feature engineering and really think about how to represent features in a way that makes it easy for the model to learn sparse patterns.

Secondly, it’s helpful to curate a loss function that closely represents the business tradeoffs. This is a good practice in general to maximize the business impact of ML models. When dealing with tail events specifically, these events are so damaging to the business that it’s even more important to ensure the model accurately accounts for these tradeoffs.

If you are passionate about building ML applications that impact the lives of millions of merchants, Dashers, and customers in a positive way, consider joining our team.

Acknowledgements

Thanks to Alok Gupta, Jared Bauman, and Spencer Creasey for helping us write this article.