As an orchestration engine, Apache Airflow let us quickly build pipelines in our data infrastructure. However, as our business grew to 2 billion orders delivered, scalability became an issue. Our solution came from a new Airflow version which let us pair it with Kubenetes, ensuring that our data infrastructure could keep up with our business growth.
Data not only powers workflows in our infrastructure, such as sending an order from a customer to a restaurant, but also supplies models with fresh data and enables our Data Science and Business Intelligence teams to analyze and optimize our services. As we grew to cover more than 4,000 cities, all the data became more complex and voluminous, making orchestration hard to manage.
Managing data in our infrastructure to make it usable for the DoorDash teams who need it requires various pipelines and ETL jobs. Initially, we used Airflow for orchestration to build data pipelines and set up a single node to get started quickly. When scalability became an issue, we looked for another orchestration solution.
The open source community came to the rescue with a new Airflow version adding support for Kubernetes pod operators. This solution was perfect for our needs, as we already use Kubernetes clusters and the combination scaled to handle our traffic.
How Airflow helped orchestrate our initial data delivery
To contextualize our legacy system we will dive into how Apache Airflow was set up to orchestrate all the ETL’s that power DoorDash’s data platform. Apache Airflow’s platform helps programmatically create, schedule, and monitor complex workflows, as shown in Figure 1 below the DAG gets extremely complicated. It provides an easy-to-read UI and simple ways to manage dependencies in these workflows. Airflow also comes with built-in operators for frameworks like Apache Spark, Google Cloud's BigQuery, Apache Hive, Kubernetes, and AWS' EMR, which helps with various integrations. When we began using Airflow for scheduling our ETL jobs, we set it up to run in a single node cluster on an AWS EC2 machine using Local Executor. This configuration was easy to set up and provided us with the framework needed to run all our ETL jobs. While this was a great place to start, it did not scale to meet our business’ continued growth.
The challenges of scaling ETL pipelines
DoorDash has significant data infrastructure requirements to analyze its business functions. These analytics cover everything from product development to sales, and all depend on getting data from Snowflake, our data warehouse vendor. Our data infrastructure powers thousands of dashboards that fetch data from this system every five to ten minutes during the peak hours of usage.
These ETL pipelines, depicted in Figure 2, below, ensure that data for analysis is delivered accurately and in a timely manner. Many of our ETL jobs run in Airflow, and about a thousand of them are very critical, high-priority jobs affecting business-critical functions. In addition, the number of ETL jobs we host grows everyday, threatening the reliability of our data orchestration pipelines.
Scaling our Airflow setup
Our original Airflow setup ran on an AWS EC2 instance, making it a single point of failure, hampering our system’s scalability. If there was machine failure or other issues with the EC2 instance our only recourse would be to replicate the whole setup on a new machine, which was not an easy or reliable alternative. It was also difficult to scale the EC2 instance dynamically for resources, causing performance issues due to CPU and memory limitations. The lack of fault tolerance and the scalability issues in our Airflow setup were not acceptable, as they could lead to outages where data was not available and important business decisions and ML models would be disrupted. These outages could also impede some of our critical data analysis, slowing the team’s productivity.
We had to build a distributed setup to make Airflow more scalable. A running instance of Airflow has a number of daemons, including a web server, scheduler, and workers, that work together. In a multi-node Airflow setup there is a node which has a web server and/or scheduler, and a group of worker nodes. Setting up daemon processes that are distributed across all worker nodes and use one of the RemoteExecutors makes it very easy to scale the cluster horizontally if we can add more workers. Although it’s a common industry practice to use Celery, a distributed task queue, in a remote setup, we found an option to set up Airflow on Kubernetes. As our infrastructure ecosystem runs on Kubernetes, this solution made the most sense as this helps with standardizing deployments, access controls, and monitoring etc.
Using Kubernetes for Airflow
The Kubernetes Executor, introduced in Apache Airflow 1.10.0, can run all Airflow tasks on Kubernetes as separate Pods. The difference between Kubernetes executor and the KubernetesPodOperator is that KubernetesPodOperator can run a container as a task, and that container will be run inside a pod on a Kubernetes cluster. KubernetesPodOperator can be used with any type of executor, including Celery executors and Kubernetes executors.
At DoorDash, we have Airflow workflows ranging from ETL transformations to Data Science pipelines. These workflows might use different configurations that have different resource needs, or trigger tasks that run in different languages or different versions of the same language. For such varied use cases KubernetesPodOperator provides more flexibility compared to other executors as it will run a container which can encapsulate these varied configuration requirements.
The Kubernetes Operator uses the Kubernetes Python client to generate a request that is processed by the API server. Pods can be launched with defined specifications. It is very easy to specify what image has to be used, especially with options like environment variables and secrets. Logs can be gathered locally to the scheduler or to any distributed logging service used in their Kubernetes cluster. Using Kubernetes, Airflow users now have more power over their run-time environments, resources, and secrets, basically turning Airflow into a more dynamic workflow orchestrator.
How did we do this migration?
To get things onto Kubernetes we took the following steps:
- Containerized ETL code
- Migrated Airflow scheduler and web server to Kubernetes
Containerizing our ETL code
The first step to move Airflow onto Kubernetes was the process of containerizing all our ETL code. To be able to run our code on Kubernetes, we first had to create a Docker image. We used Puckel’s Airflow containerization image and customized it by adding our system files and packages. Credentials were then passed to the Kubernetes Pods using Kuberenetes secrets. With our ETLs containerized, we could easily run our code inside the Kubernetes Operators. So we just need to reconfigure our Airflow setup to use Kubernetes operators.
Migrating the Airflow scheduler and web server to Kubernetes
As discussed above, two main daemons for Airflow are the scheduler and web server. For better fault tolerance and isolation, we separated the scheduler and web server into two Pods. We used a docker-compose file to manage two containers. After the two containers spun up and Airflow DAGs ran successfully, we moved to the next step, deploying the containers to our Kubernetes cluster. For continuous deployment, we need to manage and store our Docker images. Because our cluster is set up on AWS, we used the Amazon Elastic Container Registry (ECR) for storing these images.
To implement our ETL jobs using Kubernetes Operator we initially decided to start with jobs that were showing issues related to resource constraints. These jobs did not have special environment requirements so we used a common base image and then installed any additional requirements for that job. We still run some small jobs which don’t require significant resources using a local executor, but eventually every job will run in a different Pod.
This new setup enables us to meet all our Data team’s requirements. For example, whenever we get a new requirement from one of our data customers we need to ensure that their environment is saved and easily accessible. We store this requirement for resourcing in our jobs metadata table, and when we need to execute we can easily access the requirements and use it to spin up a new Kubernetes Pod. The only thing that we need to watch out for are the large image downloads that can cause timeouts, which can be solved by increasing the startup_timeout_seconds variable for the Pods.
Figure 3 and figure 4 depict the memory available on the Airflow scheduler pod. Before we started using the KubernetesPodOperator, available memory for the scheduler was hitting 0 causing many ETL jobs to fail. After completing our migration, as shown in figure 4 the available memory continued to stay above 50% and jobs finished much faster as they were not competing for resources.
We got a lot of benefits from moving our Airflow setup to Kubernetes. Dockerizing helped us make our systems easier to run in this new setup. We now have more scalability and better resource utilization compared to using a single Kubernetes Node. The new system is helping us onboard 10 to 15 new ETL jobs per week without having to worry about how resources will be allocated for them. This setup on Kubernetes can be scaled by adding more nodes to ensure this growth is sustainable. With improved resource utilization ETL jobs are running faster and we are able to run more jobs concurrently, ensuring we meet our data availability SLA for our users. Additionally, new kinds of jobs are easier to integrate if they have different environment requirements. For example, we had a job that needed a specific version of Python and we were able to run it with a custom image and environment.
For companies that are starting their data journey, managing data pipelines with Airflow is a great choice but it needs to be set up in a scalable way using distributed execution. DoorDash started with a simple Airflow set up but this became a bottleneck with our growing business and data needs.We looked for options to mitigate and KubernetesPodOperator turned out to be an excellent solution/choice. Running our orchestration on Kubernetes helped us standardize deployment, monitoring, and access controls while supporting a more customizable workload reliably. For companies that are struggling to orchestrate data pipelines with a variety of environments and resource requirements, Airflow with Kubernetes Pod Operators is a great solution that provides scalability and dynamic setup to run a variety of workloads.
Thanks to Sophia Rong, Ian Schweer, Si Li, Satya Boora, and many other data infra engineers for their contributions and in making this a successful project. Additional thanks as well to Sudhir Tonse and the DoorDash Compute team for their support.
I’m curious why you chose to use the Puckel image, rather than extend Apache’s own image, is that because it’s more lightweight? I have found airflow image to be pretty hefty.