Enforce Timeout: A DoorDash Reliability Methodology

author:

Enforce Timeout: A DoorDash Reliability Methodology

“What would happen if we removed statement timeouts in our Postgresql databases?” That’s one of the questions asked in a management meeting. At the time I only responded that it would be bad — it would cause problems and make it harder to debug. However, I realize now that this is a topic that many people don’t dig deeply into, so I’ve decided to share some of best practices in DoorDash.

So, what would happen? In the best case, we may notice nothing for a few weeks. However, in a fast growing company like DoorDash with hundreds of engineers contributing code everyday to a complicated system, we would eventually have an outage which is introduced by unbounded resource usage which can be easily captured if not prevented by a proper timeout setting.

Why You Need Timeout

We live in a realistic world and resources are always limited. For actions like creating a connection, making databases queries or executing functions, timeout itself won’t improve efficiency of any of them. However, timeout creates an upper bound, stops outliers from causing outsized damage.

Normally timeout has a direct impact on site latency, while site health is indicated by several signals — throughput, latency, error rate and saturation. If the exception or return code is captured correctly, you will see the signal in the error rate. If you don’t have enough capacity, you may see signal in saturation, e.g., web requests will queue up and throughput will drop. Similarly, if you don’t have timeout limits, the error rate won’t increase while the resource consumption continues to grow . It will just be a matter of time before the system becomes saturated and throughput drops .

With good timeout settings, damage can be kept under control and more visible. Without timeout setting, your system may fail in unexpected ways and people will struggle to figure out why.

How to Set Timeout

Everything should have a timeout, but what should the value be? It is complicated. For example, Postgresql has various timeout related settings — statement_timeout, lock_timeout, keepalives related configurations, idle_in_transaction_session_timeout, etc. Different domains like queuing systems and web servers have their own sets of timeouts. A good way to think about it is to do it in an architectural way considering the upstreams and downstreams. Tower of Hanoi Timeouts presents a great rule of thumb for structuring nested timeouts. tl;dr:

No child request should be able to exceed the timeout of the parent.

No matter the timeout is about connection creation, lock or query/transaction execution, please make sure no child request should be able to exceed the timeout of the parent. Timeout normally means some process would be forced to exit. Parent level timeout would hide the issues which could be caught at the child level and increase the difficulty to root cause issues.

Specifically for Databases

Timeouts are oftentimes overlooked in the world of databases. A simple explanation is that lots of developers are very optimistic and trying to live in an ideal world of infinite resources. Among all of the resources, CPU and I/O are normally the most impacted by slow queries, which can be gated by statement timeout settings in Postgresql. For databases, whenever CPU or I/O is saturated, it’s a disaster. We tested some of our most resource intensive queries on our staging databases and AWS Performance Insights suggested we need 1000x bigger instance type to support it, which of course is not available in AWS (yet).

In AWS Aurora, timeout can be even more important than in traditional RDS. unexpected I/O may add too much pressure for the cluster to handle, forcing replica nodes to restart in order to recover to a healthy state. Aurora has an interesting architecture. All nodes, master or replicas, are all sharing the same underlying storage layer. You will still see millisecond level replica lag which isn’t really related with data replication. It takes the replicas sometime to update their memory for things like Btree index. When Aurora replica feels it is lagging to much, a restart will be triggered to get things consistent quickly.

Statement timeout tuning in DoorDash

Inspired by a talk with our friend Marco Almeida, who lowered the postgresql statement timeout to 1s at Thumbtack, we started tuning our statement timeout setting one year ago. The original setting was 30s which was even longer than our uwsgi Harakiri setting. Currently, our main databases’ default statement timeout setting is 2s, and it is still a work in progress to lower it even more.

We use several tools to gain visibility and insight into how our statement timeout tuning affects our system PgAnalyze helps us to gather info from pg_stat_activity so that we know how long each query takes. Exceptions caused by transactions hitting statement timeout are visible to us in Sentryand NewRelic. Error rate is reviewed weekly and we have organizational support to fix the errors since it is so important. It improves stability, performance and saves money. In this way we create a feedback loop to make our system more and more stable.

Proper timeout settings contribute to production stability and debuggability. They are important. Spend time to improve the configurations if you care about reliability and do not think about removing them.

%d bloggers like this: