As companies utilize data to optimize and personalize customer experiences, it becomes increasingly important to implement services that can run machine learning models on massive amounts of data to quickly generate large-scale predictions. At DoorDash, our platform is utilizing data to power models that curate search results, assign dashers, recognize fraud, and more. This would only be possible with a robust prediction service that can apply our models to the data and serve our various microservices that rely on data-driven insights. We recently implemented DoorDash’s next generation prediction service, and named it Sibyl prediction service, after the Greek oracles who were known to utter predictions in an ecstatic frenzy. In this blog post you will learn the ideation, implementation and rollout of Sibyl and the steps we took to build a prediction service that can handle massive numbers of calls per second without breaking a sweat. If you are interested in how models are integrated with the service you can check out this article here .While Sibyl itself may be unique to DoorDash, the learnings and considerations we used to build it can be applied to almost any high throughput, low latency service. If you are looking to learn about the greater ML platform, check out this blog post: DoorDash’s ML Platform – The Beginning
Ideation and Requirements: The prediction service’s role in Machine learning (ML) infrastructure
Before starting any actual coding, we took some time to consider exactly what role Sibyl would have in DoorDash’s ML infrastructure ecosystem, as well as all the required functionalities that this new prediction service needed. In terms of its role, we wanted the Sibyl prediction service to handle all real-time predictions, and to focus on predicting, leaving other components, such as feature calculation, model training pipelines, and storage of features and models, in separate services/stores. The first thing we considered was scalability and latency, given that we were expecting hundreds of thousands of predictions per second, and that in order to convince other services to call our service for predictions, we needed to make them so fast that calling Sibyl would be better than the alternative of each individual service making predictions itself.How Sibyl prediction service fits into the ML infrastructure ecosystem:
Figure 1: Sibyl prediction service high level flow
As you can see by Figure 1, all predictions come in from other services as gRPC requests, and Sibyl retrieves both models and features from independent stores (with the option of outputting predictions to Snowflake for offline evaluation). Possible future expansion includes a separate model evaluator, which can be used for pure prediction computation needed for complex models. For V1, however, this will be included within Sibyl.In terms of required functionalities, here were some highlights:
Batch predictions: We should allow each prediction request to contain any variable amount of feature sets to predict on (note: a “feature set” is a set of feature values we want to predict on. A simple way of thinking about this is that a feature set is just an input we want to predict on). Batch predictions are essential, as they allow client services to send and retrieve 1 to N predictions at once, greatly reducing the number of calls needed.
Shadow predictions: In addition to making predictions and sending them back to client services, the option to make shadow predictions asynchronously in the background was essential. Oftentimes, before finalizing and clamping down on one particular model, teams may have multiple candidate models, and want to test multiple models at once on the same data. Allowing teams the option to use one model for official predictions, and to asynchronously make predictions on the same data with different models, gives them the flexibility and power to analyze the efficacy of various candidate models.
Feature and model fetching: As mentioned above, Sibyl needs to be able to fetch both features and models from their respective stores. For features, they would be fetched during each prediction, and for models, we could save time and computing power by first fetching all models when the service starts up, then caching them in-memory, eliminating the need to load them for each request.
Implementation: General service overview and decision highlights
To get a general understanding of how the service works, as well as a brief overview of the moving parts of the service, here’s what the lifecycle of a typical request looks like:
Figure 2: Lifecycle of a typical request
Referencing Figure 2:
The request arrives.
For each model identified in the request, we grab both the model, as well as the model config (which contains info on that model, such as all required features, default fallback values for features, and model type) from an in-memory cache.
Then we iterate through the values provided in the request, and find out if there are any missing feature values that were not provided. We do this for all models and all feature sets at once, and store values in a map for easy lookup.
For all missing features, we attempt to retrieve feature values from the feature store, which is a Redis cache of feature values. If they still cannot be found, we set the feature values as the provided default value in the model config.
Now that we have all features and all feature values required for predictions, we asynchronously make a prediction for each feature set. For each shadow model, we also launch an asynchronous coroutine, but don’t wait on the results to finish before continuing.
With all predictions made, finally, we construct a response protobuf object, and populate it with the predictions, returning the protobuf back to the client.
Now I want to highlight some decisions/details we’ve made that were noteworthy:Optimizing prediction speed using native API callsThe two ML frameworks that Sibyl uses, LightGBM and Pytorch (if you are curious as to why we chose these two, give DoorDash’s ML Platform – The Beginning a quick read), have API frameworks implemented in a variety of different programming languages. However, in order to optimize for speed, we decided to store models in their native format, and make prediction calls to these frameworks in C++. Working in C++ allowed for us to minimize the latency for making the actual predictions. We used the Java Native Interface (JNI) so that the service, implemented in Kotlin, can make the LightGBM and Pytorch prediction calls, implemented in C++.Coroutines, coroutines, and more coroutinesDue to the demanding scalability and latency requirements, one of our top priorities was to make sure that all predictions were being conducted in parallel, and that when waiting for features to be fetched from the feature store, threads would actually be performing computations and calculations (instead of just waiting). Thankfully, developing in Kotlin gave us much needed control over threads via its built-in coroutine implementation. Kotlin’s coroutines aren’t tied to a specific thread, and suspend themselves while waiting, meaning that they don’t actually hold the thread, allowing the thread to perform work on something else while waiting. While it is possible to implement similar behavior in Java using callbacks, syntactically, creating and managing Kotlin coroutines are far cleaner than Java threads, making multithreaded development easy.
Rolling Out: First loading testing, then introducing into production
Conducting a test runWe decided to test Sibyl's prediction capabilities on one of DoorDash's most in-demand services. DoorDash’s search service has many responsibilities, one of which includes calculating which restaurants to show you when you visit doordash.com. You may not realize it, but every single restaurant you see on DoorDash needs to be scored and ranked beforehand, with the score being used to personalize your experience on the site, providing different restaurants in different positions on the site for different people (Figure 3).
Figure 3: The search service’s ranking illustrated
Currently, the search service’s ranking logic is done in-house and within the service itself. So what we decided to do was when at any time, the search service was about to rank a restaurant, it would spawn an asynchronous thread that would also call Sibyl. Doing this gave us the ability to not only verify that the prediction service works as intended, but also allowing us to accurately load test the service. Furthermore, the asynchronous calls ensured that any calls to Sibyl would not slow down the search service’s endpoints.Sibyl ended up handling over 100,000 predictions per second, wow! This test run demonstrated that Sibyl was now ready to handle the throughput required for our production services, and that services at DoorDash could start migrating their models to call the service for any predictions.Toggling request batch size and other tuning to optimize the serviceOne configuration we played around with was the batch size for each prediction request. Since there are potentially hundreds of stores that can appear on your store feed, the search service actually ranks hundreds of stores as once. We were curious to see how much faster each request would be if instead of sending all stores at once to Sibyl, we split the request into sizable chunks, so that instead of predicting on ~1000 stores at once, Sibyl predicted on 50 stores in 20 separate requests. We found that the optimal chunk size for each request was around 100-200 stores. Interestingly, smaller chunk sizes, such as chunks of 10 and 20 stores, actually made latency worse. Therefore, there was a nice middle ground, illustrating that while the number of stores per request mattered, the service performed better when the chunks were decently large in size. The hypothesis is that if chunk sizes are too small, then the number of requests increases substantially, resulting in request queuing and higher latencies. On the other hand, if predictions contain all 1000 stores, then the amount of data to send and receive balloons, and the propagation delay between client and service becomes our bottleneck. This finding was actually encouraging for us, as it demonstrated that we efficiently implemented Sibyl to run predictions in parallel, and that on a large-scale, the service is able to make substantial batch predictions without hiccup.Besides chunking, request compression was looked into as well. As mentioned above, one issue with these batch requests is that the sizable amount of data being sent results in large propagation delay times. With hundreds of stores and their feature values included in each request, it made sense to try to compress requests in order to reduce the number of packets in the network layer that would need to be sent to Sibyl.Finally using the service’s predictions in productionWhen load testing, although Sibyl would be called every single time a store was ranked, the result returned by the service was never actually used. The next step was to actually use these calculated values and to officially integrate the service’s predictions into the production workflow for various models. While handling requests from our search service was good for load testing, due to very strict latency requirements, it would not be the first few models migrated over. Furthermore, added time would need to be spent migrating all feature values from the search service to Sibyl’s feature store. We decided to start with some fraud and dasher pay ML models, the reasons being that the estimated QPS would be far lower, and the latency requirements were not as strict. Fraud detection and dasher pay do not need to be calculated nearly as quickly as loading the home page. Starting March of this year, both fraud and dasher pay models now officially use Sibyl for predictions. Following the rollout, one big win observed was a 3x drop in latency (versus our old prediction service), a testament to Sibyl’s efficacy.Concluding rolling out: the end of the beginningFollowing the successful roll out of the fraud and dasher pay models in using Sibyl for predictions, over the past couple of months, the ML platform team has been continuously adding more and more models to the service, and the migration of models to Sibyl is nearing completion. All but five models have been migrated and are now calling Sibyl for predictions. To learn more about their migration check out this new blog post here. The team is continuing to add support for different feature and model types. For example, support for embedded features, which are used primarily in neural networks, has been added. Composite models, models that consist of a chain of submodels and expressions, called compute nodes, have also been added. Although Sibyl’s role as the predictor for DoorDash has just begun, it has already been an exciting and active one!
Thanks to Param Reddy, Jimmy Zhou, Arbaz Khan, and Eric Gu for your involvement in the development of Sibyl prediction service