We use a variety of data stores at DoorDash to power our business, but one of our primary tools is classic relational data powered by Postgres. As our business grows and our product offerings broaden, our data models evolve, requiring schema changes and backfills to existing databases.
When DoorDash was smaller and in fewer time zones, it was reasonable to take a few minutes of downtime at night to perform these kinds of data operations. But as we have grown to include merchants, customers, and Dashers in over 4,000 cities on two continents, it is no longer acceptable to have downtime in our system. Our global span means we have to engineer solutions to perform major operations on huge tables without disrupting the business.
During our pre-growth phase, the most obvious way of backfilling a new column was to simply add the column as nullable, and then start a background process to fill in the rows in batches. But some of our tables have become so large and include so many indexes that this process is far too slow to stick to any kind of reasonable product schedule.
Recently, our Storage team honed a backfilling technique for our Postgres databases that allows us to completely rebuild a table—changing types and constraints for multiple columns all at once—without affecting our production systems. The unexpected benefit of this technique is that we can transactionally swap the two tables both forwards and backwards, enabling us to safely test out the new table and switch back if problems arise while maintaining data integrity.
This method reduced a projected three month project down to less than a week while letting us update tables in a production environment. Not only could we add our new field, but we also had the opportunity to clean up the entire data model, fixing data inconsistencies and adding constraints.
The need for data backfills
Not every schema change requires a backfill. Often, it is easier to add a nullable column and create default behavior for NULL values in application code. While fast, this process has disadvantages, such as not being able to add database-level constraints. If application code erroneously forgets to set the value, it will get default behavior, which may not be what was intended.
Some schema changes require a backfill, though. For example, switching data types for primary keys requires that all historical data be updated. Similarly, denormalization for performance reasons requires backfilling historical data if no sensible default behavior can be implemented.
The difficulties of in-place data backfills of large tables
Trying to update every row in a large production table presents several problems.
One problem is speed. Updating a column on a billion rows in Postgres is equivalent to deleting a billion rows and inserting a billion rows, thanks to the way Multiversion Concurrency Control (MVCC) works under the covers. The old rows will have to be collected by the VACUUM process. All of this puts a lot of pressure on the data infrastructure, using compute cycles and potentially straining resources, leading to slow-downs in production systems.
The number of indexes on the table amplifies this pressure. Each index on a table effectively requires another insert/delete pair. The database will also have to go find the tuple in the heap, which requires reading that portion of the index into the cache. At DoorDash, our production data accesses tend to be concentrated at the very tail end of our data, so the serialized reads of a backfill put pressure on the database caches.
The second problem is that if the writes happen too fast, our read replicas can fall behind the primary writer. This problem of replica lag happens at DoorDash because we make heavy use of AWS Aurora read replicas to service read-only database traffic for our production systems. In standard Postgres, the read replicas stay up to date with the primary by reading the write-ahead logging (WAL), which is a stream of updated pages that flows from the primary writer to the read replicas. Aurora Postgres uses a different mechanism to keep the replicas updated, but it also suffers from an analogous problem of replication lag. Aurora replicas typically stay less than 100 milliseconds behind, which is sufficient for our application logic. But without careful monitoring, we found that it is fairly easy to push the replica lag above 10 seconds, which, unsurprisingly, causes production issues.
The third major problem is that even “safe” schema changes, like widening an
INT column to
BIGINT, may uncover unexpected bugs in production code that are not trivial to locate by mere inspection. It can be nerve-wracking to simply alter an in-use schema without a backup plan.
The solution to all of these issues is to avoid altering the production table in-place entirely. Instead, we copy it to a lightly indexed shadow table, rebuild indexes after, and then swap the tables.
Creating a shadow table
The first step is to create a shadow table with a schema identical to that of the source table:
CREATE TABLE shadow_table (LIKE source_table);
The new shadow table has the same schema as the source, but with none of the indexes. We do need one index on the primary key so that we can do quick lookups during the backfill process:
ALTER TABLE shadow_table ADD PRIMARY KEY (id);
The final step is to make schema alterations on the new table. Because we are re-writing the entire table, this is a good opportunity to discharge any technical debt accumulated over time. Columns that were previously added to the table as nullable for convenience can now be backfilled with real data, making it possible to add a
NOT NULL constraint. We can also widen types, such as taking
INT columns to
ALTER TABLE shadow_table ALTER COLUMN id type BIGINT; ALTER TABLE shadow_table ALTER COLUMN uuid SET NOT NULL; ALTER TABLE shadow_table ALTER COLUMN late_added_column SET NOT NULL;
Writing the copy function
Next we will create a Postgres function that will copy and backfill rows at the same time. We will use this function in both the trigger, which will keep new and updated rows synchronized with the shadow table, and the backfill script, which will copy the historical data.
The function is essentially an
INSERT paired with a
SELECT statement, using
COALESCE statements to backfill null columns. In this example, we weren’t adding any columns, so we rely on the fact that the two tables have columns in the same order, but if this operation had added columns, we could deal with those here by listing the columns explicitly in the
CREATE OR REPLACE FUNCTION copy_from_source_to_shadow(INTEGER, INTEGER) RETURNS VOID AS $$ INSERT INTO shadow_table SELECT id, COALESCE(uuid, uuid_generate_v4()) created_at, COALESCE(late_added_column, true), ... FROM source_table WHERE id BETWEEN $1 AND $2 ON CONFLICT DO NOTHING $$ LANGUAGE SQL SECURITY DEFINER;
COALESCE statements are the essential parts--the effect here is "look to see if a value is
NULL and, if so, replace it with this other thing.” The use of
COALESCE() allowed us to do some data repair on over a dozen columns all at the same time.
BIGINT conversion is free with this technique. Just alter the shadow table’s schema before starting the procedure and the
INSERT handles the type promotion.
Finally, we want to make sure that we do no harm, so this function is written in a way to minimize the risk of the backfill script accidentally overwriting newer data with stale data. The key safety feature here is the
ON CONFLICT DO NOTHING, which means it is safe to run this function multiple times over the same range. We will see how to deal with updates in the trigger below.
Setting the trigger
Even application developers well versed in the intricacies of SQL may not have had an opportunity to use a database trigger, as this feature of databases tends not to be integrated in application-side frameworks. A trigger is a powerful feature that allows us to attach arbitrary SQL to various actions in a transactionally safe way. In our case, we will attach our copy function to every type of data modification statement (
DELETE) so that we can ensure that all changes to the production database will be reflected in the shadow copy.
The actual trigger is straightforward, except that for
UPDATE it performs a
INSERT pair inside a transaction. Manually deleting and re-inserting in this way allows us to reuse the main backfill function (which otherwise would do nothing because of the
ON CONFLICT DO NOTHING). It also ensures that we won't make a mistake and overwrite newer data because the backfill function can’t perform an
CREATE OR REPLACE FUNCTION shadow_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( TG_OP = 'INSERT') THEN PERFORM copy_from_source_to_shadow(NEW.id, NEW.id); RETURN NEW; ELSIF ( TG_OP = 'UPDATE') THEN DELETE FROM shadow_table WHERE id = OLD.id; PERFORM copy_from_source_to_shadow(NEW.id, NEW.id); RETURN NEW; ELSIF ( TG_OP = 'DELETE') THEN DELETE FROM shadow_table WHERE id = OLD.id; RETURN OLD; END IF; END; $$ LANGUAGE PLPGSQL SECURITY DEFINER; CREATE TRIGGER shadow_trigger AFTER INSERT OR UPDATE OR DELETE ON source_table FOR EACH ROW EXECUTE PROCEDURE shadow_trigger();
Performing the backfill
For the actual backfill, we used a custom Python script that uses a direct database connection in a production shell. The advantage here is that development is interactive, we can test on a staging environment, and we can stop it instantly if something goes wrong. The downside is that only the engineer who has access to that production shell can stop it, so it must be run while someone is able to monitor it and stop if something goes awry.
In our first round of backfilling, the speed was orders of magnitude faster than in our earlier attempts at in-place modification of the original production table. We achieved about 10,000 rows per second.
In fact, the real problem is that we were writing a bit too fast for our replicas to keep up under production load. Our Postgres replicas generally have replication lag that is sub-20 milliseconds even under a high load.
With a microservices architecture, it is common for a record to be inserted or updated and then immediately read by another service. Most of our code is resilient to slight replication lag, but if the lag gets too high, our system can start failing.
This is exactly what happened to us right at the tail end of the backfill—the replication lag spiked to 10 seconds. Our suspicion is that because Aurora Postgres only streams pages that are cached in the replicas, we only had issues when we started to touch more recent data residing in hot pages.
Regardless of the cause, it turns out that Aurora Postgres exposes instantaneous replication lag using the following query:
SELECT max(replica_lag_in_msec) as replica_lag FROM aurora_replica_status();
We now use this check in our backfill scripts between
INSERT statements. If the lag gets too high, we just sleep until it drops below acceptable levels. By checking the lag, it is possible to keep the backfill going throughout the day, even under high load, and have confidence that this issue will not crop up.
Making the swap
Postgres can perform schema changes in a transaction, including renaming tables and creating and dropping triggers. This is an extremely powerful tool for making changes in a running production system, as we can swap two tables in a transactional way. This means that no incoming transaction will ever see the table in an inconsistent state—queries just start flowing from the old table to the new table instantly.
Even better, the copy function and trigger can be adjusted to flow in the reverse direction. The
COALESCE statements need to be dropped, of course, and if there are differences in the columns, those need to be accounted for, but structurally the reverse trigger is the same idea.
In fact, when we first swapped the tables during this particular operation, we uncovered a bug in some legacy Python code that was expressly checking the type of a column. By having the reverse trigger in place and having a reverse swap handy, we instantly swapped back to the old table without data loss to give us time to prepare our code for the schema changes. The double swap procedure kept both tables in sync in both directions and caused no disruption to our production system.
This ability to flip flop between two tables while keeping them in sync is the superpower of this technique.
All database schemas evolve over time, but at DoorDash, we have an ever evolving set of product demands, and we have to meet those demands by being fluid and dynamic with our databases. Downtime or maintenance windows are not acceptable, so this technique not only allows us to make schema changes safely and confidently, but also allows us to do them much faster than traditional in-place backfills.
Although this particular solution is tailored for Postgres and uses some features specific to AWS Aurora, in general this technique should work on almost any relational database. Although not all databases have transactional DDL features, this technique still minimizes the period of disruption to the time it takes to perform the swap.
In the future, we may consider using this technique for other types of schema changes that don’t even involve a backfill, such as dropping lightly-used indexes. Because recreating an index can take over an hour, there is considerable risk involved in dropping any index. But by having two versions of the same table in sync at the same time, we can safely test out these kinds of changes with minimal risk to our production system.
Many people helped with this project. Big thanks to Sean Chittenden, Robert Treat, Payal Singh, Alessandro Salvatori, Kosha Shah, and Akshat Nair.
How does this (seemingly more manual) migration approach differ from other off the shelf OSC tooling like pt-online-schema-change, gh-ost, etc, etc?
One reason is just that we primarily run on Amazon Aurora Postgres, so many extension tools are not available, and even if they were, we would not be comfortable deploying them on our production clusters without extensive testing. A second reason is control. Because of how much demand we place on our databases, we need to be able to control the amount of load and its effect on latencies. We need to make sure we can stop any process once started if we see effect on our production systems. For example, here we implemented custom bask-off logic that monitored Aurora replication lag and slowed the write volume if it spiked. A final reason is validation and trust. Some of these tools repeatedly warn the user to backup their data before using, which is not confidence inspiring to use on a mission critical database. Operations that use standard SQL give us great confidence that our solution is correct.
Wondering if this:
CREATE TRIGGER shadow_trigger
AFTER INSERT OR UPDATE OR DELETE ON shadow_table
FOR EACH ROW EXECUTE PROCEDURE shadow_trigger();
CREATE TRIGGER shadow_trigger
AFTER INSERT OR UPDATE OR DELETE ON
FOR EACH ROW EXECUTE PROCEDURE shadow_trigger();
thanks for the keen eye — fixed in the body now