Leveraging OpenTelemetry For Custom Context Propagation

June 17, 2021 10 Minute Read Backend 34

Amit Gud

Amit Gud is a software engineer at DoorDash, since 2020, working on the Developer productivity team. He holds a Masters Degree in Computer Science from Kansas State University.

Diving deep into custom context use cases

DoorDash uses custom context to power a number of important use cases. Remote Procedure Calls (RPCs) that microservices rely on to delegate work to other services use a standard transport protocol like HTTP or HTTP/2, and an encoding format like Protobuf, Thrift, or JSON to transmit requests and responses over the wire. Each service serves incoming requests using the data provided in the request. However, sometimes it is useful, or in some cases even required, to include additional data with the incoming request. One such example is having authentication tokens for the actors involved in a transaction. The authentication typically happens closer to the network edge and the resulting token can be passed as a protocol header instead of a separate request field for the service call graph.

Another use case is testing-in-production, which allows test traffic to flow through the production deployment. We attach a tenant-id context to every request, distinguishing test traffic from production traffic, letting us isolate data to ensure test traffic is not mutating production data. The data isolation is abstracted in the infrastructure libraries, which use context to route the traffic to specific infrastructure components like databases and caches. With large-scale microservice deployments, the industry is converging on testing-in-production for reliable testing with lower operational overhead.

Many of the use cases that rely on context propagation are critical for running our normal business operations. This puts stringent reliability and correctness requirements on the context propagation infrastructure.

Context propagation with OpenTelemetry

For propagation, the context can be embedded right into the request itself, for example, by modifying the request’s Protocol Buffers. However, a more flexible approach is to propagate the context as a protocol header. Using headers to propagate context scales especially well when there are a diverse set of services involved and when context needs to be propagated for most of the endpoints exposed by the services. Another advantage of using the header for propagation is that the caller does not need to explicitly add the context to the outgoing calls as the propagation can be implicit, hence adding a context becomes a less invasive change.

OpenTelemetry requires propagation of trace headers. This includes the tracing IDs and vendor-specific headers. OpenTelemetry provides auto-instrumentation to help propagate trace headers across thread and service boundaries. Auto-instrumentation covers an increasingly large variety of libraries and frameworks across different languages. This is especially true for Java/Kotlin, which is used by most of the DoorDash backend services.

Some notable features of OpenTelemetry’s context propagation are that it:

Is available through auto-instrumentation.
Supports libraries and frameworks in a variety of languages that we use at DoorDash, including Java/Kotlin, Node, Python, and Go.
Is a vendor agnostic propagation format, which includes open formats like the W3C's Trace Context and Baggage.
Supports synchronous flows like HTTP and HTTP/2, and asynchronous flows like Kafka.

OpenTelemetry supports multiple formats for propagation of context including Baggage, a format specifically designed for propagating custom context.

OpenTelemetry propagation formats

OpenTelemetry supports a variety of propagation formats, like Trace Context, Baggage, Zipkin, and B3. At DoorDash we are standardizing on Trace Context for tracing data. For custom context propagation we are standardizing on Baggage.

A close look at OpenTelemetry’s propagation formats

Trace Context defines two headers: traceparent and tracestate.

A traceparent header, shown in Figure 1, helps uniquely identify an incoming request. It contains version, trace-id, parent-id, and trace-flags. This header helps stitch together the spans that a request generates as it flows from one component to another.

Figure 1: A traceparent header consists of opaque identifiers used for tracing.

The tracestate header, shown in figure 2, contains a key-value pair of arbitrary data that allow additional identifiers to be propagated along with the traceparent header. This header contains key-value pairs delimited by commas.

Figure 2: The Tracestate header is formatted as free text containing comma-delimited key-value pairs.

Tracestate can be used to propagate custom context, but there are a few limitations. The standard recommends the size of the header can be limited. Although this is not a hard requirement and the limit can possibly be increased by making it configurable, if it is changed it will need to happen for every service.

Baggage, shown in figure 3, is designed to propagate custom context that has much higher limits on the actual size of the data being propagated. It defines a header called baggage, which is very similar to tracestate.

Figure 3: The Baggage header is formatted as free text containing comma-delimited key-value pairs.

As shown in figure 4, custom context can be defined as a key-value pair similar to tracestate. Additionally, tags or properties can be defined for the key by appending them with semicolons.

Figure 4: Baggage headers can optionally contain additional properties for the key-value pairs.

We abstract away the storage/retrieval of the custom context in helper libraries for all the common languages in use at DoorDash. Service owners can introduce a new custom context by adding it to a central configuration, shown in Figure 5, which also serves as an allowlist. The configuration is a simple JSON allowing service owners the ability to define certain properties of the context.

{
 "test-workspace": {
   "max_length": 16,
   "allowed_from_client": true,
   "short_name": "tws"
 },
 "tenant-id": {
   "max_length": 16,
   "allowed_from_client": true,
   "short_name": "tid"
 },
 ...
}

Figure 5: This custom context allowlist shows two fields, test-workspace and tenant-id, with three properties each specifying maximum length allowed for the field, a flag to indicate if the field can be propagated from the web/mobile clients, and a short name used for actual propagation.

By introducing a custom context library, shown in figure 6, we can change the underlying implementation for context propagation. For example, this approach provides flexibility in using a distributed cache like Redis for larger context and propagating only the cache reference using the OpenTelemetry headers.

Figure 6: The custom context library, used by services to access context, abstracts the underlying implementation of the context. It uses OpenTelemetry headers and an optional distributed cache, like Redis, for larger contexts.

Eventually we envision having OpenTelemetry-based propagation right from our mobile and web clients. For now, we use raw protocol headers to propagate context from the mobile and web clients. Figure 7 details the flow of headers as the request travels from the web/mobile clients to the backend services. We use automatic instrumentation for onboarding the supported services to OpenTelemetry. OpenTelemetry-based propagation begins at the backend-for-frontend (BFF) services. Additionally, the incoming raw protocol headers are transformed into OpenTelemetry headers, which are then propagated to the backend services using the OpenTelemetry auto-instrumentation feature.

Figure 7: Context is propagated using raw protocol headers from mobile/web clients, which are then transformed into OpenTelemetry headers in the BFF services. Backend services use OpenTelemetry headers exclusively for propagation.

It is important to note that the sampling policy for the OpenTelemetry traces does not affect the propagation of context. The sampling policies only affect collection and aggregation of the traces.

Rolling out new versions of OpenTelemetry

Being one of the early adopters of OpenTelemetry, we had to keep up with the rapid churn of the open source tooling and the frequent releases, including incompatible API changes. We quickly realized that we would potentially have multiple versions of the OpenTelemetry tooling deployed in production. Fortunately, the open propagation format helps preserve header formats across versions. However, we do have to deal with tracking library versions that depend on specific OpenTelemetry versions. Bumping up the OpenTelemetry version sometimes requires bumping up versions of related libraries of services en masse. We have been exploring tools to facilitate auto updating of the library versions, including some home grown tools.

The rollout of a new OpenTelemetry version is handled with caution given the rapid development within the project. In order to contain the fallout, we have devised a way to selectively roll out a new version to a portion of the fleet and gradually ramp up as we build confidence. That being said, because critical use cases rely on context propagation, it is imperative that context is being propagated regardless of the OpenTelemetry version a service is using.

Addressing security considerations

With OpenTelemetry auto-instrumentation, the headers are propagated implicitly and unconditionally. While this simplifies adoption, it poses the risk of exposing potentially sensitive context to third-party entities that are being called. Although auto-instrumentation can be disabled for library propagation, it cannot be disabled selectively based on the network targets. The risk of exposure also applies to third-party entities calling into DoorDash, which might bring in irrelevant context that we might prefer not to be propagated to DoorDash services. To address this, we drop all OpenTelemetry headers other than traceparent both at ingress and egress of the DoorDash network. This prevents unwarranted injection of context from outside the network, as well as exposure of internal context to the outside network.

Library abstraction for custom context allows us to optionally encrypt just the headers if the service-to-service traffic is not encrypted. This provides an additional layer of security preventing exposure of potentially sensitive data.

Conclusion

Use of context propagation for propagating cross-cutting and frequently required business context is pervasive in a rapidly growing microservice architecture. OpenTelemetry offers a solution that not only enables distributed tracing in a vendor-agnostic manner but also provides easy-to-use open source tooling for a variety of languages and platforms. With certain security and rollout guardrails in place, custom context propagation via OpenTelemetry can help accelerate the use cases that come to rely on it.

Comments

Leave a Reply Cancel reply

Related Positions

Autonomy Engineer, Behavior Prediction San Francisco, CA Autonomy Engineer, Platform San Francisco, CA Autonomy Engineer, Remote Assistance - Labs San Francisco, CA See All Jobs

Building Faster Indexing with Apache Kafka and Elasticsearch

DoorDash describes how it built a faster search index using open source projects.

Satish Saley

Danial Asif

Siddharth Kumar 13 Minute Read

Backend General

Examining Problematic Memory in C/C++ Applications with BPF, perf, and Memcheck

As applications grow in complexity, memory stability is often neglected, causing problems to appear over time. When applications experience consequences of problematic memory implementations, developers may find it difficult to pinpoint the root cause. While there are tools available that automate detecting memory issues, those tools often require re-running the application in special environments, resulting ...

Filip Busic 47 Minute Read

Backend General

Functional Core, Imperative Shell – Using Structured Concurrency to Write Maintainable gRPC Endpoints in Kotlin

In this post, we will show how we write gRPC endpoints using the functional-core, imperative-shell pattern in Kotlin

James Lamine 16 Minute Read

Backend Culture Data General Machine Learning Mobile Web

2020 Hindsight: Building Reliability and Innovating at DoorDash

DoorDash recaps a number of its engineering highlights from 2020, including its microservices architecture, data platform, and new frontend development.

WayneCunningham 8 Minute Read

Backend Web

Rebuilding and Migrating a Session Management System with Zero Downtime

Migrating DoorDash's business-critical session management system in a disruption-free manner required careful planning and monitoring.

Sin Ko

Li Pei 18 Minute Read

Backend

Building a gRPC Client Standard with Open Source to Boost Reliability and Velocity

In a microservice architecture, cross-service communication happens under a set of global rules that are hard to effectively enforce across all services without standardizing client-service communication. Relying on individual service client implementations to adhere to these rules means a lot of additional repeated work on individual teams, which has a negative impact on developer velocity. ...

Haitham Gabr 12 Minute Read

Backend

Introducing DoorDash’s In-House Search Engine

Empowering Seamless Searches: DoorDash develops an in-house search engine using Lucene for enhanced search capabilities

Konstantin Shulgin

Satish Saley

Anish Walawalkar 9 Minute Read

Backend

Building a More Reliable Checkout Service at Scale with Kotlin

In 2020, DoorDash engineers extracted the consumer order checkout flow out of our monolithic service and reimplemented it in a new Kotlin microservice service. This effort, part of our migration from a monolithic codebase to a microservices architecture, increases our platform’s performance, reliability, and scalability. The consumer checkout flow is one of the most critical ...

Yimin Wei

Zhengli Sun

Amiraj Dhawan 11 Minute Read

Backend General

Gradual Code Releases Using an In-House Kubernetes Canary Controller

New service releases deployed into DoorDash’s microservice architecture immediately begin processing and serving their entire volume of production traffic. If one of those services is buggy, however, customers may have a degraded experience or the site may go down completely. Although we currently have a traffic management solution under development for gradual service rollouts as ...

MarcoChirico 15 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

Leveraging OpenTelemetry For Custom Context Propagation

Amit Gud

Recent Posts

Diving deep into custom context use cases

Context propagation with OpenTelemetry

OpenTelemetry propagation formats

A close look at OpenTelemetry’s propagation formats

Rolling out new versions of OpenTelemetry

Addressing security considerations

Conclusion

Leave a Reply Cancel reply

Popular Posts

Related Positions

You May Also Like

Building Faster Indexing with Apache Kafka and Elasticsearch

Examining Problematic Memory in C/C++ Applications with BPF, perf, and Memcheck

Functional Core, Imperative Shell – Using Structured Concurrency to Write Maintainable gRPC Endpoints in Kotlin

2020 Hindsight: Building Reliability and Innovating at DoorDash

Rebuilding and Migrating a Session Management System with Zero Downtime

Building a gRPC Client Standard with Open Source to Boost Reliability and Velocity

Introducing DoorDash’s In-House Search Engine

Building a More Reliable Checkout Service at Scale with Kotlin

Gradual Code Releases Using an In-House Kubernetes Canary Controller