Moving e2e testing into production with multi-tenancy for increased speed and reliability

March 3, 2022 13 Minute Read Backend 10

Santosh Banda

Santosh Banda is a software engineer at DoorDash, since early 2020 , working on the developer productivity team where he is focused on microservice performance, shift left, and testing in production. He holds a bachelors degree from IIT Guwahati.

Why e2e testing is not reliable in the staging environment

We noticed that e2e testing in the staging environment was becoming unreliable due to several limitations, such as lack of observability/alerting and the lack of a good way to simulate the production environment data. These limitations resulted in the staging environment deviating from the production environment, resulting in development flows where the staging environment was either ignored or misused.

Lack of observability and alerting over the e2e functionality in the staging environment

The staging environment was missing critical observability and alerting over the e2e functionality. Even though individual services had good test coverage, observability, and alerting, it was not sufficient to ensure that all the services worked in unison for the expected e2e behavior. Since the staging environment doesn’t impact real user functionality and the constant focus on the production environment, it became hard over time to develop such observability and alerting over the e2e functionality in the staging environment.

Lack of tooling to simulate production data in the staging environment easily

For proper e2e functionality of the staging environment, staging databases need to be populated with data that is a good simulation of the production environment data. This data simulation is also important to make the e2e behavior in the staging environment similar to the production environment. Otherwise, it is hard to gain confidence that changes verified in the staging environment work similarly in the production environment. A naive approach to copy the production environment data into the staging environment would easily result in personally identifiable information (PII) ending up in the staging environment, violating our compliance requirements. These compliance requirements make the production environment data simulation difficult at the scale of DoorDash.

Subscribe for weekly updates

Development flows where the staging environment is either ignored or misused

Due to the above-discussed limitations, the staging environment became increasingly divergent from the production environment. Given the lack of observability over the e2e functionality with the staging environment, many teams began to ignore the staging environment and directly roll out features to the production environment. This development flow resulted in the staging environment missing critical functionality compared to the production environment. Additionally, given there are no service-level objectives (SLO) with the staging environment, many developers ended up deploying new, unstable service versions in the staging environment for individual service testing. This development flow resulted in the staging environment becoming even more unsuitable for e2e testing.

Why we decided to move e2e testing into the production environment

Because the staging environment had become less beneficial for e2e testing and fixing staging involved a huge effort, we decided to move e2e testing into the production environment.

Why not fix staging? Here were our reasons:

Adding observability and alerting over the e2e functionality in the staging environment involves an entire overhaul of all the services in the staging environment. Given there are hundreds of microservices in our stack, this requires huge development work and collaboration across the entire engineering organization.

There is no readily available tool to simulate the production environment data in the staging environment. Moreover, a naive approach to copy the production environment data into the staging environment would easily result in PII ending up in the staging environment, breaking our compliance requirements.
The development flow needs to be enforced so that the staging environment uses stable service versions and configurations that are the same as the production environment. Even though this is relatively easy to implement, this is a very significant cultural shift.

In contrast, we noticed the following advantages with the production environment:

The production environment has a very high focus on observability and strict service-level agreements (SLAs) available for e2e testing use cases.
Using the production environment for e2e testing with necessary isolation would avoid data duplication between the environments and eliminate the risk of breaking compliance requirements.
The production environment has enforcements in place to use stable service versions only.

Further, we noticed that developers had been using the production environment for e2e testing based on their custom workarounds rather than maintaining the services and simulating production environment data in the staging environment. Custom workarounds included:

Allowlisting/blocklisting one's employee account to customize the e2e functionality.
Ad hoc scripts to create test users and use multiple runtime configurations to change the behavior for these test users.
Building custom logic to filter test users and data in analytics.

Due to the nonstandard nature of these workarounds, developers spent a lot of time building/maintaining them, which motivated us to move e2e testing into the production environment with a focus on production environment safety and developer velocity.

How we moved e2e testing into our production environment

Once we decided to move e2e testing into the production environment, we identified the following requirements for the new solution based on the properties of the isolated staging environment.

The staging environment is isolated and not accessible to the external world (i.e., outside VPN). We required the same property in the new solution for security reasons. This requirement implies the PII generated by e2e testing is isolated and not accessible to the outside world.
To simplify and standardize the custom workarounds, we needed a standard mechanism to customize software behavior for e2e testing in the production environment while sharing most of the existing software behavior with the production environment.

Using multi-tenancy to unlock e2e testing in the production environment

The above requirements pointed us to a solution using multi-tenancy, a concept in which the same instance of the software is shared with different user management, data, and configuration.

Multi-tenancy is an architecture paradigm where one software application and its supporting infrastructure are designed to serve multiple customer segments, also called tenants. We addressed our e2e testing challenges in the staging environment by adopting the multi-tenancy model and introducing a new tenant named DoorTest in the production environment. The multi-tenancy model enabled us to easily isolate test data and customize software behavior for the DoorTest tenant.

How we incorporated multi-tenancy in our stack

To isolate DoorTest tenant data and customize software behavior for the DoorTest tenant, we first needed a standard way for all the services to identify the tenancy of the incoming requests.

First, we defined a convention for the tenant value to bring standardization. The tenant value has two levels. The first level, defined as L0, represents a high-level product vertical (e.g., DoorDash, DoorTest, Drive, Storefront). A second level defined as L1 represents a further classification necessary within the given product vertical (e.g., developer sandbox). The tenant value identified by the string <L0:L1> is propagated across all the services through OpenTelemetry. Services can isolate data and define custom behavior based on the L0 and L1 tenant values.

We provided an option to set the tenant value, i.e., <L0:L1> string in our mobile and web apps. This tenant value is attached as an HTTP header to all DoorDash API calls. This header gets transformed to baggage format at the CDN layer and propagated across all our services through OpenTelemetry. This process enables our apps to operate with DoorTest tenant users and data.

*Figure #1: Mobile client has an option in the debug panel to set L0 and L1 tenant values. Once these values are set and the app is restarted, the app will operate with DoorTest tenant users and data.*

How we isolated test data and customized software behavior for the DoorTest tenant

Through the above-discussed process, all the services in our stack can identify the tenancy of the incoming requests. We followed the following steps to isolate test data and customize behavior for the DoorTest tenant:

We created a new database for the DoorTest tenant and added a new column to all user-related tables called tenant-id, which stores the string <L0:L1>. This step introduces physical isolation of the DoorTest tenant (L0) data and logical isolation of the L1 tenant’s data.
We added a query routing layer to pick the correct database. All the queries are updated to add an extra filter for the tenant-id column based on the incoming request tenant information.
We enforced a unique non-null constraint on the phone_number column canonical to the real-world behavior in the production environment database. We relaxed this constraint by making the phone_number column nullable for the DoorTest database. This relaxation simplified the creation of test users since we don’t need to find unique phone numbers for each test user.

Once we isolated the test data, we noticed that more guardrails were necessary to enhance production environment safety against e2e testing.

Figure #2: Mobile and web clients have an option to set DoorTest tenancy. The tenant information is attached as an HTTP header to all outgoing DoorDash API calls. The CDN layer transforms the HTTP header into baggage format. The DoorTest tenant context is propagated across all the services through OpenTelemetry. Each individual service applies data isolation based on the tenancy context.

Building guardrails to enhance production environment safety against e2e testing

We leveraged the multi-tenancy model to build guardrails to enhance production environment safety against e2e testing:

Since the new tenant DoorTest is only applicable for internal usage, we needed the new tenant to be accessible only on VPN. To achieve this, we added a safeguard to our CDN layer to inspect incoming traffic and remove the tenant header if the value is DoorTest and the client IP is not in VPN.
E2e testing in the production environment shouldn’t impact real users, so we built the following guardrails by using the tenancy context of the incoming requests in the backend services:
- Test consumers cannot place orders with real stores. Similarly, real consumers cannot place orders with test stores.
- Test Dashers (DoorDash’s term for delivery drivers) cannot get orders assigned from real stores. Similarly, real dashers cannot get orders assigned from test stores.

With the above guardrails, we no longer needed the custom workarounds that are hard to maintain, thus improving developer speed. In addition, the multi-tenancy model paved the way for building more such guardrails in a standardized fashion.

Building tooling for developers to speed e2e testing in the production environment

Along with building the new tenant DoorTest with the guardrails, we identified the need for creating UI tooling that automates the most common tasks to speed e2e testing. The new tool enables the following functionality to speed up e2e testing in the production environment:.

Automatic creation of test users (consumer, dasher, etc.) in the new tenant
Simulating test user’s address or geographical location
Easy access to a list of pre-created test stores and the ability to automatically place orders from them
Ability to create Dasher shifts and automatically assign test orders to them

This internal UI tool is backed by a microservice that exposes the same functionality through gRPC API so that developers can leverage the same functionality in automated e2e tests. Through this tooling, developers can create various scenarios reliably and quickly for their e2e testing purposes.

Conclusion

Lots of companies develop using staging environments, but we found that our staging environment was hard to maintain and operate in a large distributed system. So we developed a solution that moved e2e testing into the production environment using multi-tenancy with guardrails.

By moving e2e testing to the production environment, we observed the following results:

Given the production environment has a high focus on observability and alerting, the reliability of e2e testing is very high in contrast to e2e testing in the staging environment.
The multi-tenancy model provides a standard mechanism for services to do data isolation and customize behavior for e2e testing. This eliminates the need for copying production data into the staging environment and the need for custom workarounds that are otherwise needed for developers to perform e2e testing in production.
As a result of the standardization brought by the multi-tenancy, it became possible to build developer tooling that automates the most common tasks to speed up e2e testing.
Since this multi-tenancy model provides isolation and safety for e2e testing in the production environment, we developed shift-left tooling through Signadot. This tooling allows developers to e2e test their features during the development phase to gain more confidence over their code changes in addition to unit/integration tests. We will follow up with a detailed blog post explaining our journey with this shift-left tooling.

Acknowledgements

Thanks to Marco Chirico, Adam Rogal, Amit Gud, Jessica Mckenna, Carlos Herrera, Ivar Lazzaro, Maria Sharkina, Ignacio Scopetta and many other individuals who contributed to this effort.

Comments