As machine learning (ML) becomes increasingly important across tech companies, feature engineering becomes a bigger focus for improving the predictive power of models. In fact, data scientists and ML engineers now spend 70% of their time on feature and data engineering.
At DoorDash, most of our ML applications leverage our real-time prediction service (Sibyl), which is supported by a Redis- based feature store. But while our serving stack is now fairly mature, developing new features and iterating on existing features still form a major chunk of a data scientist’s development lifecycle.
So, to accelerate development of E2E pipelines for feature engineering and serving, we built Fabricator as a centralized and declarative framework to define feature workflows. Within the past few months, data scientists have leveraged Fabricator to add more than 100 pipelines generating 500 unique features and 100+B daily feature values. In this blog, we visit some of the motivations, designs, and learnings around building the framework.
Reimagining feature engineering at DoorDash
Our legacy systems were functional but not ideal.
In our current ecosystem shown in Figure 1, most data science verticals
- Author and manage the feature generation ETLs written on top of our internal data engineering infrastructure using Snowflake and Airflow.
- Handwrite hundreds of lines of SQL joins to produce the final training datasets.
- Collaborate with the ML platform to productionalize features for online feature and model serving.
Why is this legacy workflow proving harder to scale?
Currently E2E feature engineering is an amalgamation of loosely coupled systems. This introduces the following problems:
- Too many touch points: Data scientists work with the data infrastructure to maintain their ETLs, hand-write their offline feature serving code, and work with the ML platform to author their online models. This makes model development velocity slow. Additionally, the overhead of understanding each system and its technologies closely (Snowflake, Airflow, Redis, Spark, the list goes on) makes onboarding really difficult.
- Infrastructure evolution is slow: Given the loosely coupled nature of systems with different charters, the evolution of the infrastructure is difficult. Small changes can break the entire flow. And, as our data grows with the company, it makes our systems slower and/or costlier. The entire framework needs more iterability to keep the cost curves flatter.
- We have no management UI for features: Data scientists have no access to explore hundreds of features made by other members of their org, share them across teams, or observe their lifecycle. Each feature has to be code searched across repos to access the full picture. Additionally, given that feature observability and monitoring is an ad hoc process we do not have the ability to understand drift and quality of features over time. Lack of these two abilities further reduces the development velocity for new features.
What does an ideal feature platform look like?
We embraced two major Doordash philosophies to design a more appropriate feature engineering platform:
- Build products not systems: The value is from having feature engineering be a centralized product, with a single touch point that enables every component.
- Make it easy to do the right thing: Customizations are always needed, but 80% use cases fall into a set of common templates. We wanted to make it extremely easy to do the defaults, leaving room for advanced users to add customizations with marginally more effort.
Subscribe for weekly updates
Considering the components we wanted to provide to our data scientists, we drew inspiration from really good open source solutions such as Feast and in-house solutions such as Palette from Uber and Zipline from Airbnb for the high- level design shown below in Figure 2.
Our vision for an ideal feature platform was formed from the following components:
- Data plane:
- Feature generation: A guided low-friction process to develop new features.
- Offline serving: An abstract API layer to fetch features from their underlying storage to enrich training datasets.
- Online serving : A seamless workflow to materialize produced features to an online feature store as soon as data is ready.
- Control plane:
- Feature management and discovery: An accessible UX to track and maintain the current catalog of features, their definitions, and their lifecycle status.
- Feature observability: An integrated solution to track statistics about usage and drift on features in production.
How Fabricator delivers a centralized and declarative framework
Fabricator delivers the above design using three key components:
- A central declarative feature registry that allows users to define their entire workflow all the way from generation to serving.
- A unified execution environment that provides a set of high-level APIs to freely use multiple storage and compute solutions.
- Automated and continuously deployed infrastructure, which reduces operational overhead to zero.
Creating a declarative feature registry
A big bottleneck in feature development is the boilerplate and domain specificity of the systems used to build its components. As such, the machine learning operations (MLOps) community has been gravitating toward templated YAML and SQL to manage various definition spaces at a higher level of abstraction for a while now.
We’ve had similar findings within DoorDash as well, from projects like Riviera. Fabricator provides a single YAML-based repository of definitions, which are backed by a protobuf schema and stored into a central feature registry.
Introducing sources, sinks, and features
Utilizing protocol buffers (Protobuf) as a schema definition layer for the definition objects allows our YAML files to be structured in backward- and forward-compatible ways. We achieve full descriptive power through three concepts: Sources, sinks, and features.
Sources: This YAML template describes a generative definition for a feature source. The flexibility of the proto schema allows us to use the same definition for real-time and batch features. A sample source definition is shown below:
source: name: consumer_engagement_features storage_spec: type: DATALAKE table_name: dimension_consumer_engagement_features compute_spec: type: PYSPARK spark_spec: file: … resource_overrides: … trigger_spec: upstreams: …
A few salient points about the schema:
- We declare the output storage format in the source definition. This is quite flexible and allows us to support multiple batches (S3 and Snowflake) and real-time outputs (Kafka).
- The compute spec is an extensible definition that allows separate customization based on compute type. Spark, Snowflake SQL or Flink SQL are currently supported.
- Additionally, a trigger spec allows us to further customize when this pipeline should be run. As we see later, this makes it easy to automate pipeline orchestration.
Sinks: Apart from their persistent storage, features may need to be materialized to a different store for online serving. We use sinks to define such materialization stores. We currently support Redis and Kafka as materialization stores.
sink: name: search-redis type: REDIS redis_spec: cluster_node: …
Features: The primary goal for the definitions is to identify the feature lifecycle. Features are identified by name and connected to their generative source as well as materialization sinks using the YAML definition as shown below:
feature: name: caf_cs_consumer_clicks source: consumer_engagement_features materialize_spec: sink: search-redis sample: 0.5
Deploying features continuously
A key blocker towards iterability for data scientists is having a really quick release cycle to their entire feature pipeline. Fabricator maintains a repository of all YAML definitions and updates the central registry as a part of every product release CI/CD cycle. This enables their changes to take effect within minutes of their commits.
Setting up a unified execution environment
Providing an easy-to-use definition language and enabling quick deployments sounds helpful, but it’s only part of the development process. As a data scientist, testing feature pipelines and then deploying those is equally important. Traditionally, playgrounds operate in separate environments from production, making translating development code to production DSLs a cumbersome process.
To solve this, Fabricator provides a library of APIs to provide a unified execution environment during development and production, that integrate natively with the DSL. These APIs reduce the boilerplate required to build complex pipelines and offline feature serving as well as provide efficient black-box optimizations to improve runtime performance.
Using contextual executions
Fabricator provides extensible Pythonic wrappers around the YAML DSL called Contexts, which can be specialized for more specific pipelines. An example base class and its specialization for executions is shown below:
class BaseContext: def __init__( self, table_name: str, storage_type: str, schema: typing.Optional[str] = None, env: str = "staging", indexes: typing.Optional[typing.List[str]] = None, name: str = "", ): …
class FeatureContext(BaseContext): def __init__( self, features, entities, table_name, storage_type="datalake", ): super().__init__(...)
Why is this simple wrapper important? Conceptually, this is quite straightforward, but behind the scenes we leverage it for three important reasons:
- Fabricator pipelines are authored to “run” a Context. Every YAML compute spec translates to an appropriate context and applies user code to it. This makes development and production work in the same way.
- Contexts hide infrastructure interactions, as we’ll discuss later. You can operate freely between multiple storage layers (Snowflake, S3, etc) and compute layers (Spark, Snowflake SQL or just simple Pandas) through these Context objects.
- Existing definitions can be easily referenced in an abstract way. FeatureContext.from_source(‘consumer_engagement_metrics’) would give you the fully formed Context for the YAML we defined in the previous section.
Enhancing with black box optimizations
As our compute and storage offerings expand, there are a range of technological nuances to master. These questions aren’t universally known across data science. With the information stored inside Contexts, we provide healthy defaults that make onboarding really smooth.
For example, when you’re writing a Spark data frame out to S3, there’s a few open optimization questions. How should the data be partitioned? Will partitions run through a single reducer? Is there a data skew or partition skew? Answers to some of these questions may decide if a pipeline runs in fifteen minutes or four hours. We provide APIs such as write_data_spark(context, df) that identify the number and type of partitions that are optimal for your data.
Enabling offline serving
Offline serving is the portion of the workflow that focuses on using generated features to create an enriched training or validation dataset. Fabricator provides a simple get_features API that allows the naming features that are pulled for a provided DataFrame. The API infers the related Contexts and storage information and constructs efficient joins to facilitate the work.
Since these joins are blackbox APIs as well, we can apply optimization techniques to all existing jobs in one fell swoop. We were able to accelerate multiple jobs by ~5x when we leveraged key-based repartitioning for feature joins.
Automating infrastructure integrations
The discussion so far has centered primarily around creating and using features offline. But feature engineering has a few other parts to its lifecycle. Fabricator’s framework and central registry enable automation in two major ways :
- We can automatically infer orchestration and online serving needs using the same definitions that are committed to the registry. Users get this for free.
- We can add additional integration points to Fabricator to other parts of our data platform, such as our central data portal or data quality tools and amplify the gains across hundreds of features.
Automating workflow orchestration
Fabricator can automatically construct and update DAGs and dependency management for our user’s pipelines. At DoorDash, we leverage Dagster to construct date-partitioned slices of user DAGs, which automatically infer dependencies from the definitions and can be used to concurrently backfill new pipelines as far as a year within a few hours.
Automating online serving
With a simple setting of materialize spec, one can set up their pipeline to materialize features to the online feature store. Features are typically uploaded to our feature store within minutes of the data being available in offline storage.
Making feature discovery self-serve
The central registry enables multiple central UX services to surface a catalog of features and their lineage. We leverage Amundsen internally to connect our features and their source table information to the rest of core data information to create a holistic data lineage graph.
Improving feature observability
We are now starting to leverage the YAML spec to also configure observability details such as thresholds for feature defaults in production or data quality rules for output feature data using frameworks like Great Expectations or Deequ.
We’ve come a long way since we first ideated this framework. Since its launch, Fabricator has helped double the number of feature pipelines supported within DoorDash. Throughout the development of this project we have been able to learn some key things and also make a large impact on the productivity of our data scientist which we will go into more detail on below.
- Leveraging centralized changes such as array native storage and Spark-based UDFs for embeddings helped scale many of our embeddings pipelines by more than 12x in running time.
- An automated orchestration layer has made the process of backfilling much less cumbersome. We have backfilled more than 70 new jobs for up to one year of data, accelerating experiment timelines by many days in some cases.
- Optimizations are multiplicative : If you leverage a storage optimization that has twice as high throughput, and couple it with a UDF-based compute that is six times faster, cumulative wins are 12x. Such black box optimizations have brought down cumulative running times for many of our jobs from > 120 cluster hours to a few hours per day.
- Standardization accelerates development: Once you package 80% of the standard use cases behind a few simple knobs, the easing of the decision process makes iterations significantly faster, even if the underlying process didn’t change significantly.
- Parallelizable backfills provide a big boost to velocity: Backfills are an under-rated problem in machine learning data engineering. Just because one year of data backfill may take up to a few days to set up, data scientists may choose a small subsample of data to iterate faster. Having that data in a few hours instead makes iteration velocities much easier.
Much of this blog has focused on the batch abilities of the framework, but in reality ML data engineering is quickly navigating into lambda architectures. Fabricator was designed to natively operate between real time and batch feature pipelines, as the YAML describes, and leveraging that into working with hybrid pipelines (batch features bootstrapped with real time incremental features) is the future path.
Fabricator is a result of a large amount of cross functional collaboration across various platform and data science teams. I would like to acknowledge Hien Luu, Brian Seo and Nachiket Mrugank Paranjape for contributing to the growth of the framework. I would also like to acknowledge Satya Boora and his team for integrating Fabricator with our Data Infrastructure, as well as Abhi Ramachandran, Yu Zhang and the rest of the Data Science team for partnering with us to test our work.