For every growing company using an out-of-the-box search solution there comes a point when the corpus and query volume get so big that developing a system to understand user search intent is needed to consistently show relevant results.
We ran into a similar problem at DoorDash where, after we set up a basic “out-of-the-box” search engine, the team focused largely on reliability. While the search results looked reasonable on the surface, it was common that particular queries in select locations would lead to poor search experiences. At DoorDash’s scale, even having relatively rare negative search experiences would still negatively impact a lot of our users, so we had to improve.
To fix this search relevance problem, we revisited each part of our search flow to understand why it was not performing well. Our analysis focused on the quality of the search results that were produced when a subset of our most frequently searched queries over the previous month were entered. We chose this subset because, like most search systems, our query data is long-tailed, and these frequent queries constitute a substantial portion of our overall search volume, as shown in Figure 1, below:
Based on the findings from our analysis, we re-engineered our search pipeline with an initial focus on improving the quality of results the consumer sees when they search for a high frequency query. Even with this limited scope, our improvements led to a statistically significant increase in the overall conversion rate. In other words, consumers were more likely to find the type of food they were looking for, and placed an order.
First, we will go over the shortcomings of the legacy system, and then we will talk about how we improved this system for a subset of the head queries.
Overview of our search pipeline
When a user comes to our site and writes text into the search field, that query goes through two steps of our search pipeline to return search results. The first step, recall, retrieves the store and item documents which are relevant to the consumer’s query and within a set distance from their location. The second step, ranking, reorders the documents to best meet a consumer’s needs.
A deeper dive into the recall step
In the recall step, we use Elasticsearch to retrieve the store and item documents which are within an orderable distance from the consumer, and best match the consumer’s query intent.
This process starts with our filtering process, which ensures that all stores are within an orderable distance of a consumer. This filtering process involves a complex interplay between the preferences of the store, the delivery fee, and the distance of the store to the consumer. These nuances deserve a separate discussion, and are left out of this article for the sake of brevity.
Next, our matching process involves making token and n-gram matches of the query against relevant text fields, such as title, description, and store tags, of the documents in our database. This matching approach was sufficient a few years ago because the size and complexity of our online selection was relatively small, but it is not sufficient anymore. DoorDash has grown to serve over 4,000 cities, and has expanded from restaurants to include grocery deliveries. To get a sense of how fast we have grown, we have added about 100 million new items to our index in the last year, roughly doubling the size of our entire item index.
The issues with query ambiguity at scale
The sharp increase in the size and complexity of our data has accentuated the problems related to query ambiguity. One such problem is that consumers start seeing confusing results which do not match the search query. For example, when consumers searched for “California rolls”, they got results that showed Mexican food restaurants, instead of sushi places, because terms like “California” and “Roll” occurred somewhere in those restaurants’ menus, as shown in Figure 2, below. The issue here is that the search engine treated the query as a Bag-of-words, looking at each word individually rather than as a concept in its entirety.
Another issue was that the search engine had issues discerning exact matches. For example, when we typed in “salsas”, four stores showed up in the results as opposed to the 83 that would appear if the query was just “salsa”, as shown in Figure 3, below. The search engine did not understand that this was the same query, with one just being the plural of the other.
These inconsistencies can lead to frustration and a higher churn rate from our app, since the search results are not really capturing the consumer’s search intent.
How out-of-the-box search ranking struggled with concept searches
After the items and stores are collected in the recall step, we rank them according to the consumer’s needs. However, our search system had no dedicated component to rank the candidate stores returned from the recall step. Instead, the search engine used its original predefined relevance score, which relied on string similarities between the query and search entities, particularly store names.
As a result, the out-of-box search pipeline worked reasonably well when consumers searched for brands such as McDonald’s, Cheesecake Factory, or Wendy’s. However, the out-of-the-box search performed poorly when the query was a non-branded term for broad business categories, such as pizza, noodles, or Chinese.
Non-branded search represents many of the searches, which made fixing this a priority. To fundamentally improve search relevance, we designed a dedicated precision step. This step leverages cutting-edge machine learning techniques and takes account of search context information from multiple aspects, going beyond simple string similarities.
Rethinking the search flow
From our analysis in the previous section, it became clear that there were opportunities for improvement on both the precision and the ranking fronts. On the recall front, one of the key shortcomings of our legacy approach was that we were treating the queries as a Bag-of-words, and not attempting to understand the intent implied by the user. On the precision front, we didn’t have a sophisticated ranking model to incorporate information from the search context beyond the lexical similarities between query and search entities.
We started to fill in the gaps in our understanding on both the ranking and recall fronts, and built a more advanced and well-architected search system to modularize different but essential search components.
Building a base dataset to improve the search pipeline
We decided to focus our improvements on DoorDash’s 100 most popular queries from the previous month. We chose this set for our analysis because it accounts for a substantial percentage, greater than 20%, of our overall query volume. Additionally, it would be easier to begin with a narrow set since that would allow for rapid iteration before investing in a more scalable long-term solution.
Rethinking our fixes on the recall and precision fronts
For the first iteration, we wanted to create the simplest possible search pipeline that could help fix our recall and precision problems. Accordingly, we made the following changes in the two steps of our search pipeline:
- On the recall front, we built a three-part pipeline which identifies, interprets, and elaborates upon any query within the base set. This pipeline would help test our hypothesis that the recall can improve if we treat queries as “things” and not “strings.”
- On the precision front, we developed a new ranking model using a pointwise learn-to-rank mechanism by including search context information. This new ranking model would help us improve relevance ranking beyond sole lexical similarities, better fulfilling the user’s intent.
These changes formed the basis of our redesigned search pipeline, outlined in Figure 4, below:
Redesigning the recall step of our pipeline
Our design of the new recall pipeline had two main goals:
- Make search results more relevant for our consumers: The search results should reflect the consumer’s intent even when the consumer makes a minor spelling mistake.
- Make our search results more intuitive: When the consumer searches for a concept, they intuitively understand why they are seeing those results and don’t think the search engine misunderstood their query.
To accomplish these goals, we constructed a three-part pipeline, as detailed in Figure 4, above. The three steps are:
- Transform the query to a standardized form
- Understand the underlying concept of a standardized query
- Expand upon the concept that underlies the consumer’s intent
We will describe each of these steps in greater detail below.
We noticed that consumers often do not mention a concept in the base set directly, but refer to it using a colloquial synonym. For example, “Poulet Frit Kentucky” is how Canadians refer to KFC. Additionally, there are often minor spelling mistakes in search queries, such as writing KFZ instead of KFC. Despite these minor differences, it is clear that the consumer is referring to the same concept, KFC.
To ensure that these minor differences do not distort our pipeline’s understanding of the underlying concept, our first initiative was to remove noise from each query, and convert them into a canonical or standardized form. Our standardization process involves performing some basic query pre-processing, followed by spell correction (using the Symmetric Delete spelling correction algorithm) and synonymization (using a manually created synonym map). For the examples mentioned above, “Poulet Frit Kentucky”, “KFZ”, and “KFC” would all get canonicalized to “kfc” in our new pipeline, as shown in Figure 5, below:
Item names need not follow the English dictionary
For our initial tests, we had very relaxed parameters for the spell checker, and the query “Chick’n” was getting canonicalized to “chicken”. While this might seem like a reasonable canonicalization, we actually do have items with the term “Chick’n” in them, as shown in Figure 6, below. A consumer searching for “Chick’n” could actually be searching for a branded item named Chick’n rather than anything labeled chicken.
This form of ambiguity is common among the item names in DoorDash’s database of food items, and although our spell correction algorithm provided a reasonable correction in most cases, there was no way for us to be 100% sure that we had accurately identified the consumer’s intent. Furthermore, we currently do not have a “did you mean?” button on our platform, and therefore, if our canonicalization is incorrect, there is no way for the consumer to toggle back to the original request.
To avoid ambiguities such as these in our first pass, we made our spell correction criteria very stringent, only activating it when we found no matches between the query and any of the items in our corpus.
Query understanding with concept identification and entity linking
Given the canonical form of the query, we want to:
- Use entity-linking to identify the concept in the base set mentioned by the user
- Create a knowledge graph traversal to derive similar concepts to the one being queried
Thus, when a consumer enters a search term such as ”Kentucky Fried Chicken”, we know that:
- They are searching for a specific concept, in this case food from KFC
- That there are related concepts they would potentially be interested in, like other merchants who make fried chicken
Identifying the concept
For our first version, we performed entity-linking by matching the canonical form of a query to the canonical form of its store name. This simplistic approach was sufficient for our use case because we were working with a small entity set wherein issues seen at scale (like entity disambiguation) were a non-issue.
Identifying similar concepts
To identify similar concepts to the one described by the user, we manually created a knowledge graph, which captures the relationship between various concepts or entities within the greater food lexicon. In our knowledge graph, the vertices represent entities (for example, KFC), and the edges in the graph represent relationships between different entities (for example, KFC is a restaurant serving chicken). The entities in the knowledge graph are typically associated with several “types”. For instance, KFC’s “type” is “store”, and it also has a type labelled “Yum! Brands”, KFC’s parent company.
We created the first version of our knowledge graph with two main objectives:
- Cover all queries in our base set
- Leverage the pre-existing definitions in the DoorDash literature as much as possible
Accordingly, our knowledge graph contained three types of entities and three types of relationships, as described below.
The knowledge graph entities
The Store: As the name describes, this entity is where food is sold. Each store is associated with a primary cuisine, or store category in our terminology.
The Store Category: These are clusters of food concepts using a coarse-grained descriptor of the foods sold in a store, such as “Chicken”, “Chinese”, and “Fast Food”. Each category consists of one or more store tags, which describe the popular foods within each grouping.
The Store Tag: A fine-grained descriptor of popular items sold by restaurants on DoorDash’s platform. Examples of tags include “Fried Chicken”, “Dim Sum”, and “Tacos”.
- Each store belongs to a single category called the primary category.
- Each tag belongs to exactly one category.
- Each category can have at most one parent category.
A handful of the top 100 queries did not fall under one of the above three mentioned entity types (for example, McFlurry). To keep our approach as simple as possible, we did not include these queries in our base set.
A subset of our knowledge graph is shown in Figure 7, below. In our knowledge graph, the blue rectangles indicate stores, the red diamonds indicate the store categories, and the green ellipses indicate the store tags. In the DoorDash literature, we can have a sandwich category and a sandwich tag. Therefore, we have added suffixes to the entities of various types: “_biz” for businesses, “_cat” for categories, and “_tag” for tags.
For example, the store IHOP, annotated as “ihop_biz”, has the sandwich tag because it is associated with a primary category “breakfast_cat”, which in turn is the parent of the “sandwiches_cat” containing the “sandwiches_tag”.
Expanding on the concept underlying the query
Once the underlying and related entities to a consumer’s query are known, we are no longer constrained by simplistic notions of string matching, like n-grams, to find relevant documents. Our strategy in creating the user query was to give the highest preference to the underlying entity, and using the related entities as a fallback when sufficient results are not found.
When a consumer searches for a store such as KFC, our search query gives the highest preference to the store name, so that KFC is in position 1. The search then returns all stores having one or more of the tags (fried chicken, wings) belonging to the store’s primary category (Chicken), as shown in Figure 8, below:
When a consumer enters a query for a category, such as Asian food, our search service looks for all stores containing one or more tags that are descendents of the category in question, as shown in Figure 9, below. In a search for Asian food, Charm Thai Eatery shows up in the search results because it contains the Thai tag, which is in the Thai category, a descendent of the Asian category). The HI Peninsula restaurant shows up because it contains the Dim Sum tag, which is in the Chinese category, a descendent of the Asian category. La Petite Camille shows up because it contains the Vietnamese tags, of the Vietnamese category, which is a child of the Asian category.
When a consumer searches for a tag, such as sushi, we give the highest preference to stores containing the tag Sushi, and then search for all stores containing any of the tags belonging to the parent category, Japanese, as shown in Figure 10, below. Ramen is a tag under the Japanese category, so stores tagged Ramen would also show up in the results.
Redesigning the ranking step of our pipeline
Making improvements to the ranking portion of our search pipeline was more difficult than in the recall portion for a couple of reasons:
- There is a strong dependency with previous recall steps, and it is hard to develop both at the same time. Particularly, when it comes down to a machine learning (ML)-driven approach for ranking, the model we trained on the dataset generated by the old recall logistic is not generalized to rank well on the new list of candidate stores.
- When we were developing this solution, our search service was not integrated with Sibyl, DoorDash’s real-time prediction service. Therefore, we were greatly limited in the ML ranking models we could support from an infrastructure perspective. We decided to address this problem on two fronts. On the ranking front, we trained a basic logistic regression model with store and lexical-based features with the goal of collecting training data for our eventual learn-to-rank model. On the infrastructure front, the team was actively working with the ML platform team to integrate Sibyl with our search backend to empower ML solutions for relevance ranking in the near future.
- Because it was not in the prediction service there were limited ML opportunities for search ranking from the infrastructure perspective at the time. Instead, we decided to move forward with a simple heuristic ranker, which takes into account lexical similarity and store popularities. In this way, we could quickly roll out the entire search stack in production for testing and collecting data to train a learn-to-rank model.
We ran an experiment with three buckets: control, new recall plus current ranker, and current recall plus new ranker.
Comparing the new ranker with our current ranker, we did not see any statistically significant improvement in how many consumers placed orders based on their search results, the conversion rate. This suggests that data staleness alone was not the reason the current ranker was underperforming. As of today, we have used these learnings to set up the first version of our face-lifted ranker in Sibyl and are currently in the process of experimentation.
The recall portion validated our hypothesis that treating search queries as “things” and not “strings” dramatically improves search performance, a result that has become conventional wisdom in the scientific community. Despite overly simplifying every step in the pipeline, we saw a 9% improvement in click-through rate, 10% improvement in conversion rate, and 76% reduction in null rate, the search queries that return no results, for our overall store queries at DoorDash. This translates to a statistically significant increase in overall the conversion rate. Motivated by our results here, we have been working to expand our recall section even more, and plan to share more progress on this effort soon.
Upgrading search to better understand the query intent is a fairly common problem for growing digital commerce companies. We show here that, oftentimes, there is a lot of room for improvement even in the head queries, especially if search has not progressed from what was implemented out-of-the-box. An added benefit to improving the head queries is that the infrastructure and tooling needed to improve the head queries are identical to those needed to improve the long tail. After implementing this pipeline, our next step is to continually refine our approach while expanding our base set to include more of the long tail.