DoorDash’s Geo team built a delivery simulator to automate a formerly manual process of testing new location-based logic on our platform. This new tool, which lets us simulate the events that take place during a real-world delivery, frees up considerable engineering resources, helping DoorDash scale to meet new challenges. Our Geo team is responsible for collecting and integrating the massive amount of geospatial data that empowers our delivery services. When implementing new logic, this team previously created and ran test deliveries through our production flows, a cumbersome manual process that did not allow for quick iteration.The delivery simulator build required choosing a design that would most accurately reflect real-world deliveries. Its architecture produces a realistic flow, from when a simulated Dasher receives a delivery assignment, through the routes calculated by the DoorDash platform, and, finally, delivery to the simulated customer.Automating tools such as these enables crucial velocity for teams at companies in a hyper-growth phase that must scale accordingly.
Manually testing the Geo service
When we introduce new location-based logic, such as changing the radius of our geofence around merchants, to DoorDash’s Geo service, our team runs end-to-end tests for a delivery to ensure the quality of our deployment. Previously, the only way to run these tests involved manually creating a test delivery through the DoorDash app, and using DoorDash’s internal tools and the Dasher app to simulate the delivery flow, as shown in Figure 1, below:
Figure 1: When testing new location logic, our Geo team used a manual process of creating a test delivery, then running it through our delivery flow.
Using the DoorDash app to simulate a delivery is not the best approach, for multiple reasons:
Manual testing makes innovation harder and slower. Creating, assigning, and completing deliveries is a painfully slow process. For just one test, we had to undergo the whole process shown in Figure 1. This limits the number of iterations we can make on our service, since testing is a bottleneck for deployment.
Manual testing is unreliable. Manually mocking real-time location updates for a test delivery is nearly impossible, unless we could physically walk or drive from the store to the consumer. Also, we cannot easily test different store or consumer addresses, since changing those would also require a lengthy manual process. This makes testing for nuanced or localized situations difficult.
Faced with these challenges, we decided to automate this process by building a location-based delivery simulator.
Choosing the best simulator design archetype
Before implementing the simulator, we had to carefully choose a structure and design archetype that would most effectively fit our needs. The literature on simulation is vast, but we considered three main types: Discrete-event simulation: Discrete-event simulation models a distinct series of events over time. The state of the system changes only when an event occurs. The benefits of discrete-event simulation are that it is simple and fast to implement, and real data points, such as location updates, are always collected in the discrete form. The drawback of this model is that it approximates continuous systems, such as traffic and transportation, yielding a less realistic and less accurate model. Continuous simulation: In continuous simulation, the state of the system is constantly changing. This type of simulation can more accurately replicate continuous phenomena, but is more complex to implement. In particular, it requires the use of differential equations to build. Agent-based modeling: This type of model is used to simulate behavior and interactions between individuals, or agents, within a system. It provides the most complexity, since each agent can have variations in their behavior or traits. For example, it can replicate a driver who tends to drive fast or slow, or one who often veers off route. Due to its complexity, it is more time-consuming to implement. It also has the most relevance when there are many entities being observed.
Table of simulator design archetypes
How it works
Captures distinct events over time
Simple and fast to implement
Real data is always collected in discrete form
Produces a less realistic and less accurate model
Presents a scenario that is constantly changing
Can more accurately replicate continuous events
Complex to implement, requires the use of differential equations
Simulates behaviors between different individuals
Provides more nuance and variation in individual behavior
Is one of the hardest to implement
After careful consideration, we chose to use discrete-event simulation with fixed-increment time progression. Fixed-increment time progression means that the simulation clock advances every x seconds, and the simulation state is updated at the end of that interval. This is as opposed to next-event time progression, which means that the simulation clock jumps to the time of the next event. We went with discrete-event simulation for a few reasons;
First, it is quick to implement. Although the simplest implementation, it is extensible: we can later increase the complexity with better location, routing, or traffic models, or we could even integrate agent-based modeling on top of discrete-event simulation if we wanted to observe the behavior of a fleet of Dashers.
Second, discrete-event simulation yields itself nicely to our service architecture. Our Geo service collects batches of location data from Dashers every few seconds through Apache Kafka. Fixed-increment time progression accurately replicates this interaction, since it allows us to update the state, and thus send location updates, at the end of a set interval.
Finally, since our service only observes data in discrete intervals and is agnostic to what happens in between those intervals, a lot of the complexity and power granted by the other simulation types would be lost.
Architecting our implementation of the simulator
Now that we had figured out what model we wanted to use, the next step was to plan where the simulator fit into our architecture. We would need to integrate it with existing parts of the DoorDash platform while devising its own routing and location update models for it to deliver accurate results.
The high-level overview of the simulation architecture
Figure 2: Our simulator takes inputs from DoorDash platform components and our test data, delivering information that lets us verify any new location logic.
As shown in Figure 2, above, the simulator takes the test Dasher, merchant, and consumer information, the original location of the Dasher, and the desired delivery status (one of picking up, waiting at store, or dropping off) as its input. The simulator uses this input to create a test delivery and assigns it to the Dasher through the endpoints in the DoorDash platform. It stores this information in the simulator state, then uses the Google Maps Directions API to build its routing and location update models.After this initialization, the simulator is ready to run. During each cycle, the position of the Dasher is updated using the routing and location update models. This new location is used to update the simulator state, transitioning the delivery state if necessary. Using the new simulation state, the simulator publishes a location update to our Kafka cluster, reports any proximity or location-based logic that was triggered due to the update, and discloses any change to the delivery state that occurred in the previous interval. The simulation runs until either the user pauses the simulation or the delivery completes.
Designing the routing model for the Dasher’s movement on a delivery
The routing model is currently comprised of four parts:
The route from the Dasher’s origin to the merchant location
The Dasher’s movement around the merchant
The route from the merchant to the consumer
The Dasher’s movement around the consumer
The reason why we need to take the Dasher’s behavior around the merchant and consumer into account is that, historically, this movement is often erratic and unpredictable, causing uncertainty in how we handle these cases. So, having a way to model the movement of unusual delivery routes in the simulator allows us to test the resilience of our service in these edge cases.We pull the Dasher’s route to the merchant and consumer from the Google Maps Directions API and store the state coordinate, end coordinate, duration, and distance of each step into our model. For modeling the Dasher’s movement around the merchant and consumer, we insert a custom routing model. This could be a stagnant or randomized route, or one with small perturbations around the point of interest.
Building the location update model
The location update model is used to advance the Dasher along the route. For our first version, we used a naive location model that assumes no traffic, meaning that the Dasher moves at a constant speed along each step of the leg. This results in a path like that shown in Figure 3, below:
Figure 3: Our first version of the delivery simulator assumes a constant speed for the Dasher along each leg of the trip. A more sophisticated version, coming soon, will let us model traffic and other delays onto the route.
While this simplistic model serves our immediate needs for testing location and proximity-based events, in the future, we might want to test the resiliency of our service with more realistic conditions. This model can then be replaced with a more advanced, stochastic model that either randomizes the speed of the Dasher or uses historical data to influence driving behavior.
With the creation of this simulator, the Geo team can now bypass all of the manual work previously required to test our code and focus purely on improving the geospatial aspect of our service. Companies experiencing rapid growth need to automate testing solutions, as we did with this simulator, in order to scale their reliability. Implementing a simulation opens the door for more innovation, as it allows teams to iterate on a service while remaining flexible enough to reveal weak spots or improvements that can be made. It also yields more accuracy, reliability, and performance: an automated solution can be run as a health check on services, providing assurance that everything is working as expected, and can quickly send alerts if something goes awry.Although the solution outlined in this article is oriented towards a location-based service, the core design and architecture can be readily applied towards a wide range of applications. Discrete-event simulation is a flexible and extensible tool, whose design and complexity can be modified to fit many specific needs.Janice Lee joined DoorDash for our 2020 Summer engineering internship program. DoorDash engineering interns integrate with teams to learn collaboration and deployment skills not generally taught in the classroom.