Monitoring is hugely important, especially for a site like DoorDash that operates 24/7. In modern-day DevOps, monitoring allows us to be responsive to issues and helps us maintain reliable services. At DoorDash, we use StatsD heavily for custom metrics monitoring. In this blog post, we take a deeper look at how DoorDash uses StatsD within its infrastructure.
What is StatsD?
According to The News Stack:
“StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application. Originally, StatsD referred to a daemon written by Etsy in Node.js. Today, the term StatsD refers to both the protocol used in the original daemon, as well as a collection of software and services that implement this protocol.”
According to Etsy’s blog:
StatsD is built to “Measure Anything, Measure Everything”. Instead of TCP, StatsD uses UDP, which provides desirable speed with little overhead as possible.
StatsD at DoorDash
At DoorDash, we use StatsD configured with a Wavefront backend to build a responsive monitoring and alerting system. We feed more than 100k pps (points per second) to Wavefront via StatsD, both from the infrastructure side and the application side. We use StatsD and Wavefront to monitor the traffic throughput of different API domains, RabbitMQ queue lengths, uWSGI and Envoy stats, restaurant delivery volumes and statuses, market health, etc. Wavefront’s powerful query language allows us to easily visualize and debug our time series data. The alerting system is flexible enough to notify us through email, Slack or PagerDuty based on severity. Well-tuned alerts helps us lower MTTD (Mean Time To Detect). We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.
As a startup company, scalable issues come with high growth rate. Last year, the scalability of our metrics infrastructure was challenged. Our Growth engineering team reported that the volume of our end-user activities was much lower than expected. After cross-referencing our tracking data in other sources, we confirmed that the problem lay within our monitoring system. Solving these scaling issues along the way led us to a more scalable infrastructure.
Monitoring infrastructure evolution
- One StatsD on one AWS EC2
At the beginning, we had one StatsD process running on an eight core AWS EC2 instance.The setup was quick and simple; however, when incidents happened, the load average alert on this EC2 instance never fires even if the StatsD process is overloaded. We didn’t notice the issue until Growth engineering team got paged for false alarms. Even though we have eight cores on the StatsD EC2, the Etsy version of StatsD we are running is single threaded. Thus the overall instance average of the CPU utilization was not enough to trigger the alert. We also spent some time to see how the Linux kernel handles UDP requests and gained some visibility into the server’s capacity. By looking at
/proc/net/udp, we can find the current queue size for each socket and whether it was dropping packets.
In the example above, there are lots of dropped packets and a high
1FBD (8125) at
local_address 00000000:1FBD(0.0.0.0:8125)which is listened by StatsD.
Meaning for some of the columns:
local_address: Hexadecimal local address of the socket and port number.
rx_queue: queue length for incoming UDP datagrams.
drops:The number of datagram drops associated with this socket. A non-zero number can indicate the StatsD was overloaded.
- NGINX as StatsD proxy to distribute traffic
After understanding the problem better, we started to look for solutions. There are already plenty of blogs about how people scale their StatsD, but we also tried to explore other possible ways. For example, we tried to use NGINX as a UDP proxy. We did encounter some issues with the number of UDP connections. If we want to use NGINX, we also need to figure out a way to make consistent hashing for UDP requests so that metrics will always hit the same StatsD, otherwise counters won’t be able to accumulated correctly and gauge will be showing multiple StatsD. Also potentially unbalanced hashing will cause a certain StatsD node to be overloaded. So, we decided to pass the NGINX solution at the moment.
- Local StatsD
A quick patch we did to mitigate the packet loss issue (due to maxing out a single CPU core) was to setup something we called local StatsD. Basically we installed StatsD on the EC2 instance itself and, in rare cases, inside each application containers. It was an OK short term solution, but it increased our Wavefront cost since metrics were not as well batched as before. Also the higher cardinality made Wavefront time series queries slower.
- StatsD proxy and StatsD on one EC2 host
To reduce the Wavefront cost and increase the performance, we needed to aggregate our metrics before sending to Wavefront. Each StatsD proxy is assigned to different ports on the EC2. Looking at
/proc/net/udp we know that it is useful if we can allocate more memory to the recv buffer so that we can hold more data during a traffic surge, assuming the StatsD process can consume the messages fast enough after the surge. There was some tuning we did with Linux kernel. We added the following configuration into
net.core.rmem_max = 2147483647 net.core.rmem_default = 167772160 net.core.wmem_default = 8388608 net.core.wmem_max = 8388608
- Sharding by application
Our Dasher dispatch system related applications pump a lot of matrices into Wavefront, which crashed our StatsD Proxy and StatsD EC2 often. So we started to hold three StatsD EC2 instances: dispatch StatsD for the dispatch system, monolith StatsD for our monolith application and global StatsD for all other microservices to share.
- StatsD proxy + StatsD on multiple EC2
The previous solution worked for a while. But with continued growth, the sharded architecture could no longer enough to handle the traffic. And only a limited number of StatsD proxy processes and StatsD processes can run on a single host. We had a lot of dropped packets from the global StatsD server. Instead of putting multiple StatsD proxies and multiple StatsD’s on the same host, we built a tier of StatsD proxies fronting another tier of StatsD processes. In this horizontally scalable architecture, we can add more StatsD proxies and StatsD when there is any dropped data.
- Use AWS A record instead of AWS Weighted Round Robin (WRR) CNAME for StatsD proxy DNS records. One behavior we noticed with WRR CNAME setup was that when using the StatsD, multiple servers would resolve the StatsD proxy DNS as the same IP address during the same period of time. This is likely due to how CNAME round robin works in AWS. This will end up with the same StatsD proxy server causing the overload of a specific StatsD proxy. To evenly distribute the workload of the StatsD proxies, we decided to use an old school DNS round robin of multiple A records.
- Turn on deleteIdleStats for StatsD configuration to handle missing metrics in Wavefront. DeleteIdleStats is an option that doesn’t send values to graphite for inactive counters, sets, gauges, or timers as opposed to sending 0. For gauges, this unsets the gauge (instead of sending the previous value). By turning this on, we can clearly identify “no data” metrics. And according to Wavefront, wrapping the alert condition with a default() function is an effective way of dealing with missing data.
- Use Elastic IP for StatsD proxy servers, since the client would resolve the StatsD proxy DNS as IP using the StatsD library. Once the client connects to a StatsD proxy and doesn’t try to reconnect to the StatsD proxy, the client would cache the StatsD proxy IP until the process is recycled. If the StatsD proxy servers are relaunched during this period of time, the server will not connect to the right StatsD proxy anymore unless the IP of the StatsD proxy server stayed the same as before. So by using Elastic IP, we can reduce the misconnection between StatsD client and StatsD proxy servers and lower the data loss possibility. And from client side, by configuring max request limit for a WSGI server, it should be able to re-resolve the DNS within an expected time window, which is similar as a TTL and helps to reduce misconnection.
- Continuous monitoring for StatsD EC2 and StatsD Proxy is imperative to avoid server crashes and mitigate potential disaster, especially when you are running services in the cloud. Some alerts based on the StatsD metrics are really sensitive, so we don’t want engineers to be paged because of missing data. Some of our StatsD metrics are used for our cloud resources’ autoscaling, so missing data will be a disaster.
- CPU and memory analysis of the StatsD EC2 and StatsD proxy EC2 can help to choose the right size for the EC2 instances, which reduces unnecessary cloud resource cost.
Special thanks to Stephen Chu, Jonathan Shih and Zhaobang Liu for their help in publishing this post.
See something we did wrong or could do better, please let us know! And if you find these problems interesting, come work with us!