An optimized merchant selection strategy has been one of the key factors that has enabled DoorDash to become an industry leader in US food delivery service. DoorDash was founded later than many companies in this industry, but we have effectively onboarded high-value merchants to ensure the selection in every market matches customer demand. We were able to do this quickly by developing machine learning models that identify which merchant characteristics do well on the platform while ensuring every market’s selection addresses customer food preferences and popular local merchants. These models provide intelligence to our merchant sales teams so that they can effectively find and onboard merchants that will delight customers and expand the total addressable market (TAM) for DoorDash by adding increased selection to the platform.
What a model needs to inform merchant selection
To identify the potential market, we need to know what value off-platform merchants can bring to DoorDash consumers. To help our sales team recruit merchants that will provide a better selection, we need to rank the off-platform merchants accurately to achieve high efficiency with limited resources. Specifically, we need models that can discern numerical answers to a number of abstract questions:
- The value of a restaurant - What characteristics make a restaurant a valuable addition to our platform?
- How can we evaluate the addressable market in any given region?
- How can we determine what selection is missing and onboard merchants that offer those types of food for our consumers.
To address these areas of market intelligence, our next step is to figure out what kind of Machine Learning models we need in order to build our market intelligence system. Ideally these models will understand what makes a merchant successful and what consumers want on the platform.
Specifically, we need a system that:
Allows us to compare merchants fairly at different time periods (based on their time using the platform): To measure merchant performance fairly, we need to collect success metrics in a given timeframe and label them properly so that we can train the model on what successful merchants look like.
Builds features for model prediction: Once we have our labels, we need to develop the infrastructure to build features that will use the data to predict the above identified success metrics. Because our business is constantly expanding, this feature engineering needs to be scalable, allow for fast iterations, and work in global and local marketplaces.
Validates model performance: To choose the best model that will be most predictive with our features, we need to set up criteria for the success metrics that will be most valuable to us. This may involve evaluating different tradeoffs to determine which is best for our use cases.
Selection system design
To kick things off, we start with the overall architecture. As seen in Figure 1, we collect merchant data from various sources and create relevant features on a daily basis in our feature store.
These features capture various traits such as business type, local merchants versus chain-store merchants. The features are then used to generate the labels and training datasets. After data preprocessing, we train a set of XGBoost models by different merchant segments to predict how successful they will be when they join the platform. Model performance metrics are fed into multiple monitoring dashboards to make sure the models don’t degrade over time. We also generate daily predictions for all off-platform merchants in our system based on their most up to date information. Databricks packages are used for model management and daily prediction. In addition to the model performance dashboard, we also provide dashboards for model interpreting and adhoc debugging.
Building the feature store
For each merchant, we built features to cover a variety of aspects, including:
- Consumer intent: Consumers’ demand is a big component in this process. We learn from consumers’ based on their platform behavior and consumer intent features. However, these types of features are typically sparse. This requires us to pick the ML models which handle sparse features well
- Merchants information: To capture the differentiated quality of our potential Merchants, we generate a collection of their features such as geolocation, cuisine type, reviews, and business hours.
Training ML models to understand merchant values
We begin by selecting model labels. To understand the value of off-platform merchants, we use data from existing merchants to discern what makes a merchant successful and valuable to the platform. In the process of label selection, business input and OKR is one of the most important factors that should feed into the ML process.
The second important consideration is how do we define the true success of merchants. For example, most merchants require time to ramp up their DoorDash operation; new merchants therefore don’t appear to be immediately successful because of such onboarding tasks needed. We need to develop labels calculated in the period of time when new merchants most likely would show their true potential.
As seen in Figure 2, when generating the labels for training datasets we use stable sale activation. We ensure the aggregation window is consistent for all merchants to make them comparable. Some special considerations do have to be made, for merchants who activated and deactivated multiple times.
Ensuring model performance while building up the Merchant base
In the process of acquiring more selections to the platform, we suffered from the low sample size problem in the early days, which is one of the top reasons we had poor model performance. There are many ways to augment the training data in the machine learning world. In our case, we chose to include existing merchants who have been on DoorDash for a longer time into the training data to increase the sample size.
Model training and selection based on different business sizes and DoorDash interaction histories
During the prior two steps, we generated the training dataset with appropriate labels and related features. At this point we need to pick the right model and corresponding loss function. Because the label we chose to predict is a continuous variable, we compared several standard regression models and tree-based regression models, including linear regression models and LightGBM models.
We not only look at the aggregated model performance when choosing the best model, but further evaluate the segment performance to ensure sales are able to operate at the lowest level of detail. Some segments examples include:
- Are they new to DoorDash?
- Are they chain-store merchants?
- Are they local merchants?
We found that discrete merchant segments demonstrated significant differences in the relationship between the success label and feature sets, this led to a decision for us to build a separate model for each segment.
Finally, XGBoost models were picked with tuned hyperparameters. At this point, we concluded our model training steps, and will move to inference.
Generating valid daily predictions going forward
After training the models, we generate daily predictions for each off-platform merchant. Our business partners rely on these predicted values to understand the latest market trends. For example, some may wish to know the market potential for a new service we want to provide and whether they should be prioritized in future. To ensure such market trends are included in the models, they are typically trained every week or month using our internal infrastructure based on Databricks MLflow. During the prediction time, registered models are pulled automatically from a model store to perform predictions with features processed in Snowflake. Standard ML metrics are logged during training, including R-squared and ranking metrics.
Business operations not only rely on a pure prediction, but often utilize the reasoning behind a specific estimation. Therefore, we focused on model interpretability using Shapley values. Besides a general importance, one advantage of using Shapley is to provide the directions of each feature’s impact. These are often used by businesses to understand the market level situation. This view is also helpful to monitor the models over time. If there are big changes in important features, we dig into the training data to make sure there are no errors in the model training process.
ML system outputs and their downstream application
The ML models generate two predictions for each off-platform merchant: its rank in the local market and its success value. These are the two most important metrics for the operations team to prioritize leads for sales and calculate DoorDash’s addressable percentage for each market. For example, Figure 4 shows how we rank leads in one of our markets. To correctly calculate the addressable market, this process sorted “all” local merchants’ values, which might include both off-platform and on-platform merchants.
Evaluating the business value of models
After we have a set of predictions, we need to set up model validation. This is particularly challenging in our case, as sales in many businesses usually have long closing cycles. To translate an offline model performance to true business impact metrics, we created two types of metrics -- weighted decile CPE and decile rank score – to measure quality of the prediction and rankings, which is tailored for our actual sales lead allocations.
Weighted decile CPE is used to track the performance of model predictions. We calculate the percentage error by comparing the predicted sales and the actual sales for merchants with the same decile rank. Business inputs were collected to create additional weights during the calculation.
Decile rank score is used as a measure of if the predictions are able to rank merchants correctly, which has less requirement on the accuracy of the predicted values. It uses the balance score table to calculate the difference between predicted ranks and actual ranks. The greater the difference, the higher the balance points are. The table in Figure 4 shows the balance points associated with each actual vs predicted decile rank combination. We calculate the weighted average score based on the merchant count in each cell.
These metrics measure the model performance in two ways. The weighted CPE shows how accurate our predictions are, indicating how well we understand our potential market. It’s important for planning and goal setting for appropriate quarterly growth goals. Decile rank score measures how accurately we are ranking our off-platform merchants. As we prioritize which merchants we want to acquire each quarter, we want to pursue those who bring the most value to DoorDash customers.
Other components offered by merchant selection ML platform
Merchant selection is a complex process with multiple steps. The above model is one of the most important steps. Throughout the process, our merchant selection ML platform also offers models to answer other business questions, such as:
- How can we ensure an off-platform merchant is an open restaurant?
- Can the merchant fill the demand gap for certain items in specific geographical locations?
- What’s the probability that the merchant will come to DoorDash?
These are all important questions to consider when allocating our sales resources.
Many business-oriented applications need to forecast the value of potential merchants in order to allocate resources more appropriately, including allocating human resources, developing personalized promotion packages, and potential product sales. We have seen great value from investing in such intelligence to improve operational efficiency.