When DoorDash approached the limits of what our Django-based monolithic codebase could support, we needed to design a new stack that would provide a strong foundation for our logistics services. This new platform would need to support our future growth and enable our team to build using better patterns going forward.
Under our legacy system, the number of nodes that needed to be updated added significant time to releases. Bisecting bad deploys (finding out which commit or commits were causing issues) got harder and longer due to the number of commits each deploy had. On top of that, our monolith was built on old versions of Python 2 and Django, which were rapidly entering end-of-life for security support.
We needed to break parts off of our monolith, allowing our systems to scale better, and decide how we wanted our new services to look and behave. Finding a tech stack that would support this effort was the first step in the process. After surveying a number of different languages, we chose Kotlin for its rich ecosystem, interoperability with Java, and developer friendliness. However, we needed to make some changes to handle its growing pains.
Finding the right stack for DoorDash
There are a lot of possibilities for building server software, but for a number of reasons we only wanted to use one language. Having one language:
- Helps focus our teams and promotes sharing development best practices across the whole engineering organization.
- Allows us to build common libraries that are tuned to our environments, with defaults chosen to work best at our size and continued growth.
- Allows engineers to change teams with minimal friction, which promotes collaboration.
Given these characteristics, the question for us was not whether we should pursue one language but which language we should pursue.
Picking the right coding language
We started our coding language selection by coming up with requirements for how we wanted our services to look and operate with each other. We quickly agreed on gRPC as our mechanism for synchronous service-to-service communication, using Apache Kafka as a message queue. We already had lots of experience and expertise with Postgres and Apache Cassandra, so these would remain our data stores. These are all fairly well-established technologies with a wide array of support in all modern languages, so we had to figure out what other factors to consider.
Any technology that we chose would need to be:
- CPU-efficient and scalable to multiple cores
- Easy to monitor
- Supported by a strong library ecosystem, allowing us to focus on business problems
- Able to ensure good developer productivity
- Reliable at scale
- Future-proofed, able to support our business growth
We compared languages with these requirements in mind. We discarded major languages, including C++, Ruby, PHP, and Scala, that would not support growth in queries per second (QPS) and headcount. Although these are all fine languages, they lack one or more of the core tenets we were looking for in our future language stack. Given these considerations the landscape was limited to Kotlin, Java, Go, Rust, and Python 3. With these as the competitors we created the chart below to help us compare and contrast the strengths and weaknesses of each option.
Comparing our language options
Language | Pros | Cons |
---|---|---|
Kotlin | - Provides a strong library ecosystem - Provides first class support for gRPC, HTTP, Kafka, Cassandr, and SQL - Inherits the Java ecosystem. - Is fast and scalable - Has native primitives for concurrency - Eases the verbosity of Java and removes the need for complex Builder/Factory patterns - Java agents provide powerful automatic introspection of components with little code, automatically defining and exporting metrics and traces to monitoring solutions | - Is not commonly used on the server side, meaning there are fewer samples and examples for our developers to use - Concurrency isn’t as trivial as Go, which integrates the core ideas of gothreads at the base layer of the language and its standard library |
Java | - Provides a strong library ecosystem - Provides first class support for GRPC, HTTP, Kafka, Cassandra, and SQL - Is fast and scalable - Java agents provide powerful automatic introspection of components with little code, automatically defining and exporting metrics and traces to monitoring solutions | - Concurrency is harder than Kotlin or Go (callback hell) - Can be extremely verbose, making it harder to write clean code |
Go | - Provides a strong library ecosystem - Provides first class support for GRPC, HTTP, Kafka, Cassandra, and SQL - Is a fast and scalable option - Has native primitives for concurrency, which make writing concurrent code simpler - Lots of server side examples and documentation is available | - Configuring the data model can be hard for people unfamiliar with the language - No generics (but finally coming!) means certain classes of libraries are much harder to build in Go |
Rust | - Very fast to run - Has no garbage collection but still memory and concurrency-safe - Lots of investment and exciting developments as large companies begin adopting the language - Powerful type system that can express complex ideas and patterns more easily than other languages | - Relatively new, which means fewer samples, libraries, or developers with experience building patterns and debugging - Ecosystem not as strong as others async/await was not standardized at the time - Memory model takes time to learn |
Python 3 | - Provides a strong library ecosystem - Easy to use - There was already a lot of experience on the team - Often easy to hire for - Has first class support for GRPC, HTTP, Cassandra, and SQL - Has a REPL for easy testing and debugging of live apps | - Runs slowly compared to most options The global interpreter lock makes its difficult to fully utilize our multicore machines effectively - Does not have a strong type checking feature - Kafka support can be spotty at times and there are lags in features |
Given this comparison, we settled on developing a golden standard of Kotlin components we had tested and scaled, essentially giving us a better version of Java while mitigating the pain points. Therefore, Kotlin was our choice; we just had to work around some growing pains.
Subscribe for weekly updates
What went well: Kotlin’s benefits over Java
One of Kotlin’s best benefits over Java is null safety. Having to explicitly declare nullable objects, and the language forcing us to deal with them in a safe manner, removes a lot of potential runtime exceptions we would otherwise have to deal with. We also gain the null coalescing operator, ?., that allows single line, safe access to nullable subfields.
In Java:
int subLength = 0;
if (obj != null) {
if (obj.subObj != null) {
subLenth = obj.subObj.length();
}
}
In Kotlin this becomes:
val subLength = obj?.subObj?.length() ?: 0
While the above is an extremely simple example, the power behind this operator drastically reduces the number of conditional statements in our code and makes it easier to read.
Instrumenting our services with metrics is easier as we migrate to Prometheus, an event monitoring system, with Kotlin than other languages. We developed an annotation processor that automatically generates per-metric functions, ensuring the right number of tags in the correct order.
A standard Prometheus library integration looks something like:
// to declare
val SuccessfulRequests = Counter.build(
"successful_requests",
"successful proxying of requests",
)
.labelNames("handler", "method", "regex", "downstream")
.register()
// to use
SuccessfulRequests.label("handlerName", "GET", ".*", "www.google.com").inc()
We are able to change this to a much less error-prone API using the following code:
// to declare
@PromMetric(
PromMetricType.Counter,
"successful_requests",
"successful proxying of requests",
["handler", "method", "regex", "downstream"])
object SuccessfulRequests
// to use
SuccessfulRequests("handlerName", "GET", ".*", "www.google.com").inc()
With this integration we don’t need to remember the order or number of labels a metric has, as the compiler and our IDE ensure the correct number and lets us know the name of each label. As we adopt distributed tracing, the integration is as simple as adding a Java agent at runtime. This allows our observability and infrastructure teams to quickly roll out distributed tracing to new services without requiring code changes from the owning teams.
Coroutines have also become extremely powerful for us. This pattern lets developers write code closer to the imperative style they are accustomed to without getting stuck in callback hell. Coroutines are also easy to combine and run in parallel when necessary. An example from one of our Kafka consumers is
val awaiting = msgs.partitions().map { topicPartition ->
async {
val records = msgs.records(topicPartition)
val processor = processors[topicPartition.topic()]
if (processor == null) {
logger.withValues(
Pair("topic", topicPartition.topic()),
).error("No processor configured for topic for which we have received messages")
} else {
try {
processRecords(records, processor)
} catch (e: Exception) {
logger.withValues(
Pair("topic", topicPartition.topic()),
Pair("partition", topicPartition.partition()),
).error("Failed to process and commit a batch of records")
}
}
}
}
awaiting.awaitAll()
Kotlin’s coroutines allow us to quickly split the messages by partition and fire off a coroutine per partition to process the messages without violating the ordering of the messages as they were inserted into the queue. Afterwards, we join all the futures before checkpointing our offsets back to the brokers.
These are just a few examples of the ease in which Kotlin allows us to move fast while doing so in a reliable and scalable manner.
Kotlin’s growing pains
To fully utilize Kotlin we had to overcome the following issues:
- Educating our team in how to use this language effectively
- Developing best practices for using coroutines
- Getting around Java interoperability pain points
- Making dependency management easier
We will address how we dealt with each of these issues in the following sections in greater detail.
Teaching Kotlin to our team
One of the biggest issues around adopting Kotlin was ensuring that we could get our team up to speed on using it. Most of us had a strong background in Python, with some Java and Ruby experience on backend teams. Kotlin is not often used for backend development, so we had to come up with good guidelines to teach our backend developers how to use the language.
Although many of these learnings can be found online, much of the online community around Kotlin is specific to Android development. Senior engineering staff wrote a “How to program in Kotlin” guide with suggestions and code snippets. We hosted Lunch and Learns sessions teaching developers how to avoid common pitfalls and effectively use the IntelliJ IDE to do their work.
We taught our engineers some of the more functional aspects of Kotlin and how to use pattern matching and prefer immutability by default. We also set up Slack channels where people could come to ask questions and get advice, building a community for Kotlin engineering mentorship. Through all of these efforts we were able to build up a strong base of engineers fluent in Kotlin that could help teach new hires as we increased headcount, building a self-sustaining cycle that continually improved our organization.
Avoiding coroutines gotchas
gRPC was our method of choice for service-to-service communication, but at the time lacked coroutines, which needed to be rectified to be able to take full advantage of Kotlin. gRPC-Java was the only choice for Kotlin gRPC services, but it lacked support for coroutines, as those don’t exist in Java. Two open source projects, Kroto-plus and Protokruft, were working to help resolve this situation. We ended up using a bit of both to design our services and create a more native feeling solution. Recently, gRPC-Kotlin became generally available and we are already well underway migrating services to use the official bindings for the best experience building systems in Kotlin.
Other gotchas with coroutines will be familiar to Android developers that made the switch. Don’t reuse CoroutineContexts across requests. A cancellation or exception can put the CoroutineContext into a cancelled state, which means any further attempts to launch coroutines on that context will fail. As such, for each request a server is handling, a new CoroutineContext should be created. ThreadLocal variables can no longer be relied upon, as coroutines can be swapped in and out, leading to incorrect or overwritten data. Another gotcha to be aware of is to avoid using GlobalScope to launch coroutines, as it is unbounded and therefore can lead to resource issues.
Resolving Java’s phantom NIO problem
After choosing Kotlin, we found that many libraries claiming to implement modern Java Non-blocking I/O (NIO) standards (and hence would interoperate with Kotlin coroutines quite nicely) do so in an unscalable manner. Rather than implementing the underlying protocol and standards based upon the NIO primitives, they instead use thread pools to wrap blocking I/O.
The side effect of this strategy is the thread pool is quite easy to exhaust in a coroutine world, which leads to high peak latencies due to their blocking nature. Most of these phantom NIO libraries will expose tuning for their thread pools so it’s possible to ensure they are large enough to satisfy the team’s requirements, but this places increased burden on developers to tune them appropriately in order to conserve resources. Using a real NIO or Kotlin native library generally leads to better performance, easier scaling, and a better developer workflow.
Dependency management: using Gradle is challenging
For newcomers and those experienced in the Java/JVM ecosystem, the build system and dependency management is a lot less intuitive than some more recent solutions like Rust’s Cargo or Go’s modules. In particular, some dependencies we have, direct or indirect, are particularly sensitive to version upgrades. Projects like Kafka and Scala don’t follow semantic versioning, which can lead to issues where compilation succeeds, but the app fails on bootup with odd, seemingly irrelevant backtraces.
As time has passed, we’ve learned which projects tend to cause these issues most often and have examples of how to catch and bypass them. Gradle in particular has some helpful pages on how to view the dependency tree, which is always useful in these situations. Learning the ins and outs of multi-project repos can take some time, and it’s easy to end up with conflicting requirements and circular dependencies.
Planning the layout of multi-project repos ahead of time greatly benefits projects in the long run. Always try to make dependencies a simple tree. Having a base that doesn’t depend on any of the subprojects (and never does) and then building on top of it recursively should prevent hard-to-debug or detangle dependency chains. DoorDash also makes heavy use of Artifactory, allowing us to easily share libraries across repositories.
The future of Kotlin at DoorDash
We continue to be all in on Kotlin as the standard for services at DoorDash. Our Kotlin Platform team has been hard at work building a next generation service standard (built on top of Guice and Armeria) to help ease development by coming prewired with tools and utilities including monitoring, distributed tracing, exception tracking, integrations with our runtime configuration management tooling, and security integrations.
These efforts will help us develop code that is more shareable and help ease the developer burden of finding dependencies that work together and keeping them all up to date. The investment of building such a system is already showing dividends in how quickly we can spin up new services when the need arises. Kotlin allows our developers to focus on their business use cases and spend less time writing the boilerplate code they would end up with in a pure Java ecosystem. Overall we are extremely happy with our choice of Kotlin and look forward to continued improvements to the language and ecosystem.
Given our experiences we can strongly recommend backend engineers consider Kotlin as their primary language. The idea of Kotlin as a better Java proved true for DoorDash, as It brings greater developer productivity and a reduction in errors found at runtime. These advantages allow our teams to focus on solving their business needs, increasing their agility and velocity. We continue to invest in Kotlin as our future, and hope to continue to collaborate with the larger ecosystem to develop an even stronger case for Kotlin as a primary language for server development.
While Python 3 has a strong library ecosystem, the whole ecosystem itself is weak
Pip is lacking a lot when it comes to dependency management and there is no tool that really fills the void (conda, poetry, pipenv, pip-tools)
Same for building and packaging tools
Kotlin has a REPL
>> A cancellation or exception can put the CoroutineContext into a cancelled state, which means any further attempts to launch coroutines on that context will fail. As such, for each request a server is handling, a new CoroutineContext should be created.
This is even worse. New contexts are quite expensive to create each time. What you need is a special type of job that doesn’t cancel after an exception + a good coroutine exception handler.
How did you replace the django admin functionality?
It’s awesome to see Kotlin getting adopted for backend development, thanks for sharing your experience!
If you don’t mind me asking, how did specifically solve the problem of non-blocking IO? Did you end up using any libraries out there or did you roll your own? What kind of IO resources were you consuming the most, network calls, file systems, message brokers?
Also, by any chance are you also using Kotlin in other areas like stream processing?
Both Java and Kotlin have REPL. Have got its REPL in Java 9. Good choice on Kotlin.
Great article, thanks for the insight.
Also, would be great to hear on how Door dash managed to colve CICD after migration from python to Kotlin, that must have been a challenge for sure.. looking forward to it 🙂
We had a team focused on building our own replacement, and using GRPC calls to the appropriate services to make the updates.
I will ask our release team if they want to blog about the layers and versions of our CICD system. It has evolved a lot even in the year and a half I’ve been here, and its still evolving and getting better with each iteration.
We do use Kotlin inside Apache Flink for stream processing. To get around the phantom nio issues, we curate a list of “golden” libraries we know implement coroutine friendly paradigms, or provide pre-tuned versions of libraries where no such alternative exists.
Would you be willing/able to share the “How to program in Kotlin” guide that you developed? That sounds like a great resource for others that are looking to use Kotlin outside the Android ecosystem.
Why haven’t you consider .net core c#?
I agree that Django has various problems, but I wouldn’t throw away the language just because of problems with that library and your big monolith on top. You could for example move to Flask.
You could also have explored Cython and PyPy to get more performance out of your Python code.
One way to have Python fully utilize multi-core machines is to spin up a pool of processes.
There are solutions to all your problems without getting rid of Python.
Hi Matt ,
Did you start migration java9 ? Beginning java9 , REPL is supported . Also lot of advantages you wrote about can be achieved using spring boot , Java lambdas . Concurrency is still pain in Java but looking forward how you tuned GC . Since you didn’t mention anything on peak volumes and how you scale your application on demand basis , it will be great if you mention those as well .
Kotlin has a REPL.
Can I ask why Scala was disqualified so early?
We are now on Java 11, at the time we were making this decision, we were a Java 8 shop. Yes, kotlin has a REPL now. The REPL that existed at the time of this decision was not nearly well-featured enough to be able to easily be used to do debugging with. Importing our classes so we can use the functions within to do manual testing is a must, and that has only recently become doable in an easy fashion.
Scala was disqualified because we could see little reason to include it along side Kotlin and Java. Kotlin has more forward momentum and the syntax is closer to what most of our developers are used to. Also Scala breaks compatibility with each release which means we would have to publish libraries for each scala version we have in our ecosystem. Kotlin and Java are much more forgiving in that regard and we rarely have to recompile a library with a different version of Kotlin.
I’m also curious why .net core wasn’t considered. Did you feel it is too closely tied to the Microsoft ecosystem? Lack of experience in your existing team? Thanks.
Could you share more details about why you discarded PHP?
Please point me to more articles that support the claim that Ruby, C++, PHP, and Scala don’t “support growth in queries per second (QPS) and headcount.” I don’t understand how this is true and need to read about it in order to learn.
Would it be possible to have the guidelines produced by your senior engineering staff published as well? That seems like it’s going to help Kotlin become a more mainstream backend system.
Any idea, what webframework is used to replace Django ?
We built our own on top of the Armeria (armeria.dev) project
Oh, interesting! Last time I was told you folks were using a mixture of Spring Boot and Micronaut. That was in 2021. Any reason for switching to Armeria? (Plus it’s so great seeing you are responding to questions after 2 years)
We chose Armeria to have greater control of our ecosystem and it natively supports HTTP+GRPC in one event loop. We use Armeria as the basis of our Asgard HTTP+GRPC Server as well as our Hermes HTTP+GRPC client. This means we don’t need to spin up a separate event loop to serve things like prometheus metrics.
Hi
Thank you for the blog posts. After reading them, I am planning to use Kotlin for microservices. However, I have few questions in this regard.
Looks like you are using both: a. grpc for sync. communication b/w microservices. b. Kafka (which can be used for async. communication b/w microservices.) If yes, why both?
Do you recommend this architecture: Microservice –> its Operational db –> CDC using Debezium —> Analytical and other stores like S3? Looks like you are not using CDC, instead relying on Reactive streams (letting a microservice write directly to Kafka), may I know why?
Which db are you using for “single-source-of-truth”, if any?
Also read that Doordash is tracking the metrics using Prometheus in the same process that runs a microservice. But in Kubernetes, they keep them separate.
Thank you