When DoorDash approached the limits of what our Django-based monolithic codebase could support, we needed to design a new stack that would provide a strong foundation for our logistics services. This new platform would need to support our future growth and enable our team to build using better patterns going forward. 

Under our legacy system, the number of nodes that needed to be updated added significant time to releases. Bisecting bad deploys (finding out which commit or commits were causing issues) got harder and longer due to the number of commits each deploy had. On top of that, our monolith was built on old versions of Python 2 and Django, which were rapidly entering end-of-life for security support. 

We needed to break parts off of our monolith, allowing our systems to scale better, and decide how we wanted our new services to look and behave. Finding a tech stack that would support this effort was the first step in the process. After surveying a number of different languages, we chose Kotlin for its rich ecosystem, interoperability with Java, and developer friendliness. However, we needed to make some changes to handle its growing pains.

Finding the right stack for DoorDash

There are a lot of possibilities for building server software, but for a number of reasons we only wanted to use one language. Having one language: 

  • Helps focus our teams and promotes sharing development best practices across the whole engineering organization.
  • Allows us to build common libraries that are tuned to our environments, with defaults chosen to work best at our size and continued growth. 
  • Allows engineers to change teams with minimal friction, which promotes collaboration. 

Given these characteristics, the question for us was not whether we should pursue one language but which language we should pursue. 

Picking the right coding language 

We started our coding language selection by coming up with requirements for how we wanted our services to look and operate with each other. We quickly agreed on gRPC as our mechanism for synchronous service-to-service communication, using Apache Kafka as a message queue. We already had lots of experience and expertise with Postgres and Apache Cassandra, so these would remain our data stores. These are all fairly well-established technologies with a wide array of support in all modern languages, so we had to figure out what other factors to consider.

Any technology that we chose would need to be: 

  • CPU-efficient and scalable to multiple cores
  • Easy to monitor 
  • Supported by a strong library ecosystem, allowing us to focus on business problems
  • Able to ensure good developer productivity 
  • Reliable at scale
  • Future-proofed, able to support our business growth 

We compared languages with these requirements in mind. We discarded major languages, including  C++, Ruby, PHP, and Scala, that would not support growth in queries per second (QPS) and headcount. Although these are all fine languages, they lack one or more of the core tenets we were looking for in our future language stack. Given these considerations the landscape was limited to Kotlin, Java, Go, Rust, and Python 3. With these as the competitors we created the chart below to help us compare and contrast the strengths and weaknesses of each option.

Comparing our language options

LanguageProsCons
Kotlin– Provides a strong library ecosystem
– Provides first class support for gRPC, HTTP, Kafka, Cassandr, and SQL
– Inherits the Java ecosystem.
– Is fast and scalable
– Has native primitives for concurrency
– Eases the verbosity of Java and removes the need for complex Builder/Factory patterns
– Java agents provide powerful automatic introspection of components with little code, automatically defining and exporting metrics and traces to monitoring solutions
– Is not commonly used on the server side, meaning there are fewer samples and examples for our developers to use
– Concurrency isn’t as trivial as Go, which integrates the core ideas of gothreads at the base layer of the language and its standard library
Java– Provides a strong library ecosystem
– Provides first class support for GRPC, HTTP, Kafka, Cassandra, and SQL
– Is fast and scalable
– Java agents provide powerful automatic introspection of components with little code, automatically defining and exporting metrics and traces to monitoring solutions
Concurrency is harder than Kotlin or Go (callback hell)
– Can be extremely verbose, making it harder to write clean code
Go– Provides a strong library ecosystem
– Provides first class support for GRPC, HTTP, Kafka, Cassandra, and SQL
– Is a fast and scalable option
– Has native primitives for concurrency, which make writing concurrent code simpler
– Lots of server side examples and documentation is available
– Configuring the data model can be hard for people unfamiliar with the language
– No generics (but finally coming!) means certain classes of libraries are much harder to build in Go
Rust– Very fast to run
– Has no garbage collection but still memory and concurrency-safe
– Lots of investment and exciting developments as large companies begin adopting the language
– Powerful type system that can express complex ideas and patterns more easily than other languages
– Relatively new, which means fewer samples, libraries, or developers with experience building patterns and debugging 
– Ecosystem not as strong as others
async/await was not standardized at the time
– Memory model takes time to learn
Python 3– Provides a strong library ecosystem
– Easy to use
– There was already a lot of experience on the team
– Often easy to hire for
– Has first class support for GRPC, HTTP, Cassandra, and SQL
– Has a REPL for easy testing and debugging of live apps
– Runs slowly compared to most options 
The global interpreter lock makes its difficult to fully utilize our multicore machines effectively
– Does not have a strong type checking feature
– Kafka support can be spotty at times and there are lags in features

Given this comparison, we settled on developing a golden standard of Kotlin components we had tested and scaled, essentially giving us a better version of Java while mitigating the pain points. Therefore, Kotlin was our choice; we just had to work around some growing pains.

What went well: Kotlin’s benefits over Java

One of Kotlin’s best benefits over Java is null safety. Having to explicitly declare nullable objects, and the language forcing us to deal with them in a safe manner, removes a lot of potential runtime exceptions we would otherwise have to deal with. We also gain the null coalescing operator, ?., that allows single line, safe access to nullable subfields.

In Java:

int subLength = 0;
if (obj != null) {
  if (obj.subObj != null) {
    subLenth = obj.subObj.length();
  }
}

In Kotlin this becomes:

val subLength = obj?.subObj?.length() ?: 0

While the above is an extremely simple example, the power behind this operator drastically reduces the number of conditional statements in our code and makes it easier to read.

Instrumenting our services with metrics is easier as we migrate to Prometheus, an event monitoring system, with Kotlin than other languages. We developed an annotation processor that automatically generates per-metric functions, ensuring the right number of tags in the correct order. 

A standard Prometheus library integration looks something like:

// to declare
val SuccessfulRequests = Counter.build( 
    "successful_requests",
    "successful proxying of requests",
)
.labelNames("handler", "method", "regex", "downstream")
.register()

// to use
SuccessfulRequests.label("handlerName", "GET", ".*", "www.google.com").inc()

We are able to change this to a much less error-prone API using the following code:

// to declare
@PromMetric(
  PromMetricType.Counter, 
  "successful_requests", 
  "successful proxying of requests", 
  ["handler", "method", "regex", "downstream"])
object SuccessfulRequests

// to use
SuccessfulRequests("handlerName", "GET", ".*", "www.google.com").inc()

With this integration we don’t need to remember the order or number of labels a metric has, as the compiler and our IDE ensure the correct number and lets us know the name of each label. As we adopt distributed tracing, the integration is as simple as adding a Java agent at runtime. This allows our observability and infrastructure teams to quickly roll out distributed tracing to new services without requiring code changes from the owning teams.

Coroutines have also become extremely powerful for us. This pattern lets developers write code closer to the imperative style they are accustomed to without getting stuck in callback hell. Coroutines are also easy to combine and run in parallel when necessary. An example from one of our Kafka consumers is

val awaiting = msgs.partitions().map { topicPartition ->
   async {
       val records = msgs.records(topicPartition)
       val processor = processors[topicPartition.topic()]
       if (processor == null) {
           logger.withValues(
               Pair("topic", topicPartition.topic()),
           ).error("No processor configured for topic for which we have received messages")
       } else {
           try {
               processRecords(records, processor)
           } catch (e: Exception) {
               logger.withValues(
                   Pair("topic", topicPartition.topic()),
                   Pair("partition", topicPartition.partition()),
               ).error("Failed to process and commit a batch of records")
           }
       }
   }
}
awaiting.awaitAll()

Kotlin’s coroutines allow us to quickly split the messages by partition and fire off a coroutine per partition to process the messages without violating the ordering of the messages as they were inserted into the queue. Afterwards, we join all the futures before checkpointing our offsets back to the brokers.

These are just a few examples of the ease in which Kotlin allows us to move fast while doing so in a reliable and scalable manner.

Kotlin’s growing pains

To fully utilize Kotlin we had to overcome the following issues: 

  • Educating our team in how to use this language effectively
  • Developing best practices for using coroutines 
  • Getting around Java interoperability pain points
  • Making dependency management easier

We will address how we dealt with each of these issues in the following sections in greater detail.

Teaching Kotlin to our team

One of the biggest issues around adopting Kotlin was ensuring that we could get our team up to speed on using it. Most of us had a strong background in Python, with some Java and Ruby experience on backend teams. Kotlin is not often used for backend development, so we had to come up with good guidelines to teach our backend developers how to use the language. 

Although many of these learnings can be found online, much of the online community around Kotlin is specific to Android development. Senior engineering staff wrote a “How to program in Kotlin” guide with suggestions and code snippets. We hosted Lunch and Learns sessions teaching developers how to avoid common pitfalls and effectively use the IntelliJ IDE to do their work. 

We taught our engineers some of the more functional aspects of Kotlin and how to use pattern matching and prefer immutability by default. We also set up Slack channels where people could come to ask questions and get advice, building a community for Kotlin engineering mentorship. Through all of these efforts we were able to build up a strong base of engineers fluent in Kotlin that could help teach new hires as we increased headcount, building a self-sustaining cycle that continually improved our organization.

Avoiding coroutines gotchas

gRPC was our method of choice for service-to-service communication, but at the time lacked coroutines, which needed to be rectified to be able to take full advantage of Kotlin. gRPC-Java was the only choice for Kotlin gRPC services, but it lacked support for coroutines, as those don’t exist in Java. Two open source projects, Kroto-plus and Protokruft, were working to help resolve this situation. We ended up using a bit of both to design our services and create a more native feeling solution. Recently, gRPC-Kotlin became generally available and we are already well underway migrating services to use the official bindings for the best experience building systems in Kotlin.

Other gotchas with coroutines will be familiar to Android developers that made the switch. Don’t reuse CoroutineContexts across requests. A cancellation or exception can put the CoroutineContext into a cancelled state, which means any further attempts to launch coroutines on that context will fail. As such, for each request a server is handling, a new CoroutineContext should be created. ThreadLocal variables can no longer be relied upon, as coroutines can be swapped in and out, leading to incorrect or overwritten data. Another gotcha to be aware of is to avoid using GlobalScope to launch coroutines, as it is unbounded and therefore can lead to resource issues.

Resolving Java’s phantom NIO problem

After choosing Kotlin, we found that many libraries claiming to implement modern Java Non-blocking I/O (NIO) standards (and hence would interoperate with Kotlin coroutines quite nicely) do so in an unscalable manner. Rather than implementing the underlying protocol and standards based upon the NIO primitives, they instead use thread pools to wrap blocking I/O. 

The side effect of this strategy is the thread pool is quite easy to exhaust in a coroutine world, which leads to high peak latencies due to their blocking nature. Most of these phantom NIO libraries will expose tuning for their thread pools so it’s possible to ensure they are large enough to satisfy the team’s requirements, but this places increased burden on developers to tune them appropriately in order to conserve resources. Using a real NIO or Kotlin native library generally leads to better performance, easier scaling, and a better developer workflow.

Dependency management: using Gradle is challenging

For newcomers and those experienced in the Java/JVM ecosystem, the build system and dependency management is a lot less intuitive than some more recent solutions like Rust’s Cargo or Go’s modules. In particular, some dependencies we have, direct or indirect, are particularly sensitive to version upgrades. Projects like Kafka and Scala don’t follow semantic versioning, which can lead to issues where compilation succeeds, but the app fails on bootup with odd, seemingly irrelevant backtraces.

 As time has passed, we’ve learned which projects tend to cause these issues most often and have examples of how to catch and bypass them. Gradle in particular has some helpful pages on how to view the dependency tree, which is always useful in these situations. Learning the ins and outs of multi-project repos can take some time, and it’s easy to end up with conflicting requirements and circular dependencies.

Planning the layout of multi-project repos ahead of time  greatly benefits projects in the long run. Always try to make dependencies a simple tree. Having a base that doesn’t depend on any of the subprojects (and never does) and then building on top of it recursively should prevent hard-to-debug or detangle dependency chains. DoorDash also makes heavy use of Artifactory, allowing us to easily share libraries across repositories.

The future of Kotlin at DoorDash

We continue to be all in on Kotlin as the standard for services at DoorDash. Our Kotlin Platform team has been hard at work building a next generation service standard (built on top of Guice and Armeria) to help ease development by coming prewired with tools and utilities including monitoring, distributed tracing, exception tracking, integrations with our runtime configuration management tooling, and security integrations.

These efforts will help us develop code that is more shareable and help ease the developer burden of finding dependencies that work together and keeping them all up to date. The investment of building such a system is already showing dividends in how quickly we can spin up new services when the need arises. Kotlin allows our developers to focus on their business use cases and spend less time writing the boilerplate code they would end up with in a pure Java ecosystem. Overall we are extremely happy with our choice of Kotlin and look forward to continued improvements to the language and ecosystem.

Given our experiences we can strongly recommend backend engineers consider Kotlin as their primary language. The idea of Kotlin as a better Java proved true for DoorDash, as It brings greater developer productivity and a reduction in errors found at runtime. These advantages allow our teams to focus on solving their business needs, increasing their agility and velocity. We continue to invest in Kotlin as our future, and hope to continue to collaborate with the larger ecosystem to develop an even stronger case for Kotlin as a primary language for server development.