Migrating functionalities from a legacy system to a new service is a fairly common endeavor, but moving machine learning (ML) models is much more challenging. Successful migrations ensure that the new service works as well or better than its legacy version. But the complexity of ML models makes regressions more likely to happen and harder to prevent, detect, and correct during migrations. To avoid migration regressions, it’s essential to keep track of key performance metrics and ensure these do not slip.
At DoorDash, everything that we do is deeply data-driven, meaning that ML models are embedded in many of the services that we migrated out of our legacy monolith. We developed best practices to keep track of key performance metrics and ensure minimal regressions throughout the migrations.
What can go wrong when migrating to a new service
Migrating business logic to a new service often involves a new tech stack that can alter all the interactions within services, compared to how they operated before. The most common issue to arise is that the data describing the state of the world, such as time, history, and conditions, becomes reflected differently in a new service. For example, if the new service makes asynchronous calls while the legacy service does synchronous ones, the new service gets a snapshot of more recent states. Or if the new service switches from a data store with strong consistency to one with eventual consistency, the new service does not necessarily receive a snapshot of the newest states.
Based on how the service was migrated, the different state snapshots will result in different inputs and lead to different outputs. Consider a task that sends push notifications to nudge customers to place an order when they open the DoorDash app. The notification message is determined by the time it is sent: if it is sent before 2 p.m., it will say “Your lunch is here.”, otherwise it will say “Your dinner is here.” If the task is migrated to an asynchronous worker from a synchronous API call, the asynchronous worker will send push notifications a bit later. After the migration, even though a customer opens the app at 1:59 p.m., they could receive a message reading “Your dinner is here.”
When there are different outputs, like in the example above, it is possible that a migrated function on the new service performs worse than the legacy logic it replaced. In the notification example, it is possible that customers are discouraged by the different wording and less likely to place an order. In these cases the regression would be detrimental to the business.
Why migrations involving ML models carry larger regression risks
While the regression in the example above may seem minor and relatively harmless, migrations with ML models often suffer more frequent and serious regressions. ML models tend to have dozens if not hundreds of features. A larger quantity of features generally means more differences compared to the original performance. If a single factor like the send time causes 1% of differences alone, ten more similar features can contribute 10% more differences. Making matters worse, the interactions among affected features often further amplify these differences. As a rule of thumb, the more differences there are, the more likely a regression will happen and the worse the regression will be when it does happen.
Many underlying algorithms of ML models are so complicated that even data scientists, who design the ML models and have the list of all features, do not fully understand how the features work together to create a final output. This ML model complexity has three negative consequences:
- We cannot estimate the size and direction of the differences in the performance during the design phase, therefore we cannot prepare preventive measures for any potential regression before we migrate the business logic to a new service.
- Regressions can corrupt model performance insidiously, meaning it can take a long time to detect the regression.
- It is very hard to figure out and correct the root cause of a different result even after the fact.
Good practices to prevent regressions when migrations involve ML models
Ultimately, measuring business objectives and ensuring they are the same after a migration is how you determine if the overall effort was successful. Because migrations involving ML models are more likely to experience a migration regression, and harder to detect and correct the regression retrospectively, it is a requisite that we should think beyond the act of migration itself to consider the business’ end goals. We should keep tracking the performance with respect to the business end goals and make sure that the performance does not dip throughout the migrations. Therefore we recommend these best practices to ensure there is no regression:
- Specify business metrics and acceptable thresholds.
- Identify and isolate risky components of migrations.
Specify the business metrics and acceptable thresholds in testifiable forms
Once we define the migration’s business end goals, we need to specify our success metrics and how they are measured and evaluated. In particular, metrics should have an acceptable threshold that can be stated in a testifiable hypothesis and can be supported or refuted by some scientific or sound business methodology. Measuring the migrated application’s performance against testifiable metrics thresholds can help us detect any regressions.
Let’s use the push notification example from before to illustrate what a testifiable threshold looks like. The business objective stated is to encourage customers to place more orders. One possible metric can be the average orders a customer places within 14 days of receiving the notification. In that case, the acceptable metrics threshold can be stated as “with an A/B test running for 14 days whose control group receives push notifications from the old service and the treatment group from the new one, there is no statistically significant difference between the average order amount placed by the two groups with a 0.05 p-value”.
Oftentimes, business insights can support simpler but also valid testifiable acceptable thresholds. For example, the acceptable threshold can be that the number of average orders placed in 30 days after migrations are more than 99.5% of the historical yearly average. So long as we can set up the business metrics in this way we can scientifically determine if a migration causes a regression.
Identify and isolate risky components of migrations
After determining the business metrics and acceptable thresholds, we next identify what factors can possibly degrade those metrics to figure out which components are risky to migrate. Once we identify the riskiest components, we isolate and migrate them sequentially. After each component is migrated, we should, if possible, validate their impact on the metrics against the thresholds. The idea is to monitor the performance throughout the whole migration, and detect the regression at the earliest possible moment. This method reduces the scope of the analysis when metrics degradations arise.
It is a best practice to split the risky components into the smallest modules possible. The smaller the modules are, the easier it is to analyze them and ensure they are within their success metric’s thresholds. Consider a migration that moves multiple ML models from one service to the other. Because each model can degrade the business performance in its own way, it is better to think of each of them as an independent risk factor instead of thinking of all ML models as one risk factor.
Migrating the risky components sequentially instead of in parallel will save time rather than waste it. In the above example where we need to move multiple models, migrating one of them at a time enables us to measure the success metrics with more isolation. Furthermore, what we learn from the earlier models can help us deal with future ones faster.
Applying the best practices to a DoorDash migration
As a last-mile logistics company, DoorDash relies heavily on time estimates inferenced from ML models in real-time. One group of these time estimates is used to find and assign the most suitable Dasher, our term for a delivery driver, for each delivery. This functionality needed to be migrated to a new service from our old monolith service as part of our migration from a monolithic codebase to a microservices architecture.
The migration re-engineers the data flow used in the assignment decisions. Previously, the monolith client was the center of all data exchange. In particular, the monolith client requested the inferences and passed them on to the assignment client which made the assignment decisions. After the migration, the assignment client itself requests the inferences it needs. We also utilized a new generation server to computate the ML models, as shown in Figure 1, below:
Defining our business objective and metrics
Our business objective is to assign the most suitable Dashers to the deliveries, so we decided the goal of the migration is to maintain the quality of our assignment efficiency. There are a few metrics we use internally to measure the quality of the assignment efficiency, among which two are particularly important. One, called ASAP, measures how long it takes consumers to get their orders, and the other, called DAT, focuses on how long a Dasher takes to deliver the order.
The new service has a switchback experiment framework which measures and compares the efficiencies of every assignment component change. This is the protocol and infrastructure that we decided to piggyback to measure the two migration success metrics. Specifically, we would run a 14-day switchback experiment where control groups use the output from the old service and treatment groups use the output from the new service. The success threshold is that there is no statistically significant difference (p-value is 0.05) between the results from the two groups in terms of the metrics ASAP and DAT.
The nuance here is the time restraint. We cannot spend an indefinite time on root causing and correcting the migration if the success thresholds are not achieved. After discussing with the business and technical stakeholders, we agreed to a secondary success threshold if there would be a statistically significant difference in either of the two metrics. The secondary success threshold specifies the maximum limit of the possible degradation of either ASAP and DAT. Although we didn’t end up using this criterion, having this kind of criterion upfront helped manage time and resources.
Isolated steps for independent causes
Once we set the criteria and aligned with all stakeholders, we identified which components of the migration can perturb the features and thus the inferences. In our case, migrating the models to the new server and migrating the client call to a new client service each imposed a migration regression risk. We therefore wanted to carry them out separately.
First, migrate the models to the new generation server from the old generation server
The models would be served on Sibyl, DoorDash’s newest generation online prediction service. In particular, our models are built with Python for fast development, but Sibyl stores the serialized version of the models and evaluates them with C++ for efficiency. We worked closely with the ML Platform team to ensure the consistency between the Python and C++ model predictions, especially when there were custom wrappers on top of the out-of-box models.
There is a benefit to migrating the server first. Due to the settings of other components, we were able to force the same sets of features used by both the old and new server. In particular, we are able to verify the consistency by simply comparing every pair of inferences made by the two servers. As long as the inferences were the same, the assignment decisions would be the same, and we don’t have to worry about a regression.
Second, have the assignment client request the inferences by itself instead of by the old monolith client
The legacy monolith client requested the inferences upon events of interest happening, while the assignment client runs a cron job that requests the inferences periodically. The earliest time that the assignment client can request the inferences is the next run time after the events happen. The average time difference is about 10 seconds, which moves many features and inferences around. The different inferences can skew the assignment decisions, thus degrading our assignment quality and causing a regression.
Two example features that can be moved around by the time difference are the number of orders the merchant is currently working on, and the number of Dashers who are actively Dashing. For the first feature, the merchant can get new orders or finish existing ones at any time. This makes the number of orders fluctuate often. The second feature, the number of Dashers, can fluctuate a lot because Dashers may decide to start Dashing or finish for the day at any time.
This stage imposed a large regression risk. The assignment client utilized inferences from multiple models at different places for different purposes. If all places at once had used inferences requested by the assignment client itself instead of those done by the legacy monolith client, it would have been very hard to identify the root cause had the migration failed the switchback experiment.
To reduce the regression risk, we only swapped the “source” of inferences of one model at a time. That is, for one model at a time, the assignment client changed to use the inferences made by itself from ones given by the legacy monolith client. And we ran a switchback experiment for each swap: the control group used the inferences from the model of interest requested by the old monolith client and the treatment group used ones done by the assignment client.
The switchback experiments helped us detect that two swaps had led to performance degradations. Their respective first round of switchback experiments showed significant statistical differences in ASAP and DAT, our success metrics. And the degradations were worse than our secondary criterion. The good news was that, because each swap only involved one model and its related features, we were able to quickly find the root cause and design corrective measures.
For each of the two swaps, we re-ran the switchback experiment for the same swap after corrective measures had been applied. Everything turned out to be okay, and we moved to the next swap. Once all switchback experiment tests passed, all the inferences used by the assignment client were requested by the assignment client itself and our migration finished successfully.
Many of the ML models we use have complicated underlying algorithms and a large number of features. During a migration, these two characteristics together often result in a regression that is very difficult to prevent, detect, and correct. To ensure the business is not adversely affected by the migration, we developed two practices that can help mitigate the regression risk.
At the very beginning, we specify metrics that the business is interested in, along with a quantifiably supported or refuted threshold for each of the metrics. While performing the migration, we isolate the risk factors, migrate each risky component sequentially, and validate the impact against the metrics immediately after each component is done. These two practices substantially increase the chance of detecting the regression at the earliest possible moment, and reduce the difficulty of root causing the degradation.
Our business end goal is the assignment quality. All stakeholders agreed on two metrics and their acceptable thresholds to ensure that the migration would not degrade the assignment quality, ensuring timely delivery of our customers’ orders. By isolating the risk factors and testing the metrics after each stage, we made sure that we spotted and amended any degradation as early as possible. In the end, we reached our business objective and metrics fairly quickly.
Many thanks to Param Reddy, Josh Wien, Arbaz Khan, Saba Khalilnaji, Raghav Ramesh, Richard Huang, Sifeng Lin, Longsheng Sun, Joe Harkman, Dawn Lu, Jianzhe Luo, Santhosh Hari, and Jared Bauman.