Fast Feedback Loop for Kubernetes Product Development in a Production Environment

June 23, 2022 14 Minute Read Backend 10

Santosh Banda

Santosh Banda is a software engineer at DoorDash, since early 2020 , working on the developer productivity team where he is focused on microservice performance, shift left, and testing in production. He holds a bachelors degree from IIT Guwahati.

Misa Gohara

Misa is a software engineer at DoorDash

Standard Kubernetes port-forward is unsafe and unreliable for product development

Kubernetes port-forward is a mechanism that provides access to an application running in a cluster from a developer workstation. Once the port-forward command is run, requests sent to a local port are automatically forwarded to a port inside a Kubernetes pod. Using port-forward, a service running locally can connect to its upstream dependencies. This way developers can execute requests against the service running locally to get fast feedback without the need for building, pushing docker images, or deploying them in the Kubernetes cluster.

Kubernetes port-forward is not safe and reliable for product development for the following reasons:

Port-forward requires maintaining API keys in developer workstations. At DoorDash, clients need to pass an API key in the protocol headers for authentication against the services. This creates a big hurdle in connecting to upstream dependencies because developers don’t have those API keys in their workstations. It is cumbersome for each developer to provision and maintain API keys.
Port-forward makes it difficult to restrict API endpoint access. Because port-forward RBAC gives access at the namespace level, it is not possible to build production guardrails such as restricting certain API endpoints.
Port-forward requires building and maintaining ad-hoc scripts for each service. Because each service has different upstream dependencies, port-forward requires picking local ports for each one. This requires each developer to ensure that the port numbers are unused and do not conflict with each other. These extra steps make the port-forward hard to build and maintain.
Kubernetes pod failures or restarts make port-forward unreliable. There is no auto-connect feature to cover failure scenarios where the connection is killed or closed by the upstream service. Additionally, port-forward’s unreliability is enhanced because it works at the pod level, but Kubernetes pods are not persistent.
Port-forward doesn’t support advanced testing strategies, including routing test traffic to workstations. Given the one-way nature of port-forward, it only supports sending requests to the upstream services. But developers want to send test traffic through the front-end and receive requests on their workstation that were intended for a specific service.
Port-forward is not compatible with a multi-cluster architecture. DoorDash’s move to a multi-cluster architecture highlighted Kubernetes’ incompatibility. Services don’t necessarily stay in one cluster in a multi-cluster architecture, requiring that service discovery be built on top of port-forward.

For these reasons, we quickly hit limitations that made the port-forward-based solution unreliable and hard to maintain. While other well-known solutions such as kubefwd are similar to port-forward and resolve some of these issues, they don’t address the security features we need to enable safe product development in a production environment.

Signadot and multi-tenancy combine for a fast feedback loop

Signadot is a Kubernetes-based platform that scales testing for engineering teams building microservice-based applications. The Signadot platform enables the creation of ephemeral environments that support isolation using request-based routing. Signadot’s connect feature is built on the platform to enable two-way connectivity from developer workstations and to or from remote Kubernetes clusters.

We leveraged Signadot’s connect feature to connect locally running services to those running in the Kubernetes cluster without changing any configuration. Signadot also supports receiving incoming requests from the cluster to the locally running services.

How the connect feature works

Signadot provides a CLI component running on the developer workstation and a proxy component running on the Kubernetes cluster. The connect feature works as follows, which is also illustrated in Figure 2:

Using the CLI component, a developer creates a TCP tunnel and connects to the proxy component running within the Kubernetes cluster.
The proxy component sends back all the Kubernetes services identified by the cluster DNS to the developer’s workstation.
The CLI component automatically resolves DNS and proxies network traffic intended for the Kubernetes cluster via the TCP tunnel.
The TCP tunnel also allows locally running services to receive requests from the Kubernetes cluster.

Figure 2: Using the Signadot CLI and the proxy components provided by Signadot, developer workstations can reach Kubernetes services without the need for complex port-forwarding scripts. Signadot CLI intercepts the requests intended for the Kubernetes services and routes them through the proxy using the bidirectional TCP tunnel.

Signadot improves local development significantly when compared with Kubernetes port-forwarding for several reasons, outlined here and in Figure 3:

Signadot’s connect feature works without the need for complex port-forward scripts. Developers don’t need to write and maintain the custom port-forward scripts for each service, reducing friction and easing maintenance. Running the Signadot daemon on a developer workstation through the CLI is sufficient to reach services within the cluster.
Signadot’s connect feature is agnostic to Kubernetes pod failures or restarts. Signadot proxy automatically refreshes the service endpoint information and pushes it to the developer workstation, so any pod failures or restarts won’t impact product development.
Signadot provides pluggable functionality to add custom guardrails at the request level. Signadot provides a feature to write envoy filters for the proxy component. The envoy filters help inspect the requests and apply custom rules based on the request data.
Signadot supports advanced testing strategies when developers want to route test traffic to their workstations. Using the bidirectional tunnel with the cluster, Signadot provides support to route test traffic to developer workstations through integration with Istio or custom SDKs.
Signadot provides functionality to integrate with multi-cluster environments.

Figure 3: Using the Signadot CLI, routes server, and the proxy component provided by Signadot, we supported advanced testing strategies where developers route test traffic to their workstations. Signadot provides support to route test traffic to developer workstations through integration with Istio or custom SDKs. — *Figure 3: Using the Signadot CLI, routes server, and the proxy component provided by Signadot, we supported advanced testing strategies where developers route test traffic to their workstations*. *Signadot provides support to route test traffic to developer workstations through integration with* *Istio* or *custom SDKs*.

In the example shown in figure 3 above, the developer starts service B’ in their workstation and registers a workspace with the routes server using the CLI. The developer then browses the frontend application with the workspace information injected. The requests hit service B (or its sidecar) and then get routed back to service B’ through the Signadot proxy and the bidirectional tunnel.

The need for multi-tenancy to build production guardrails

We need to provide safeguards to ensure against developers intentionally or accidentally accessing sensitive production data when using the fast feedback loop. To achieve this, we adopted a multi-tenancy model and added safeguards in the Signadot proxy so developers can only access test data.

Multi-tenancy is an architecture paradigm in which one software application and its supporting infrastructure are designed to serve multiple customer segments, also called tenants. We addressed the safeguards challenge in the fast feedback loop by adopting the multi-tenancy model and introducing a new tenant named DoorTest in the production environment. The multi-tenancy model enabled us to easily isolate test data and customize software behavior for the DoorTest tenant as explained in our earlier blog post.

In this multi-tenant model, all services identify the tenancy of the incoming requests using a common header called tenant ID. This header is propagated across all our services through OpenTelemetry, enabling our services to isolate users and data across tenants.

Now that we have a test tenancy created using the multi-tenancy mode, we use Signadot envoy filters to build custom rules that ensure requests only operate in the test tenancy by inspecting the header tenant ID.

Leveraging Signadot envoy filters to build production guardrails

Signadot provides a feature to write envoy filters for the proxy component. Envoy filters help inspect requests going through the proxy, optionally changing the request or response data. Using envoy filters, we build a custom rule that ensures requests only operate in the test tenancy by inspecting the header tenant ID.

Using the envoy filters’ mechanism, we build a custom rule to automatically inject upstream service API keys to the outgoing requests. This way our developers don’t need to configure or maintain the API keys in the developer workstations, as shown in Figure 4.

Figure 4: Multi-tenancy model isolates test data from production data. The tenancy context is propagated across all the services through OpenTelemetry, so requests in the test tenancy can only access test data. Custom envoy filters configured in the Signadot proxy restrict the requests only to the test tenancy.

This feature provided us with much-needed flexibility to build new guardrails for the fast feedback loop.

Supporting multi-cluster architecture

After establishing the fast feedback loop with a single Kubernetes cluster, we had to upgrade our Signadot proxy server to support our multi-cluster architecture. Previously, all services were running under one Kubernetes cluster. We adopted a multi-cluster architecture to improve system reliability, which means that services are running across multiple Kubernetes clusters.

In this model, the developer sending a request needs only to know the global URL for the service, not which cluster to use. When a developer requests a service’s global URL, Signadot CLI and the proxy server forward the request to the service's internal router. Then the internal router dispatches the request to the cluster in which the service is running.

We were limited to routing traffic inside a single cluster with the port-forwarding-based approach and initial Signadot solution. We could not route traffic running in newly introduced clusters. As more services move to the multi-cluster model, we upgraded Signadot so that developers can test out various scenarios.

Figure 5 shows the upgraded Signadot architecture and how each request from the local machine is propagated from one cluster to another.

Figure 5: This chart shows how our system routes a developer's request across multiple clusters. In this scenario, we assume that a developer connects to Cluster 1 and works on a service that needs to connect to Service A and Service B, where Service A runs in Cluster 1 and Service B runs in Cluster 2.

The red arrows in Figure 5 show the previous request flow within Cluster 1, which works in the prior version of Signadot.

Caramel colored arrows in Figure 5 shows how our network infrastructure synchronizes necessary information in the background. To resolve DNS for services running in different clusters, we use the xDS-server and Consul. The Consul monitors all the clusters and gathers lists of IPs for each service. The xDS-server must live in Cluster 1 to provide an API endpoint for the Signadot proxy server. For services in Cluster 1, the IP list will resolve to the actual pod IPs that are deployed in Cluster 1. The IP list for services in other clusters is resolved to an internal router.

When the request comes to the internal router, it routes the traffic to the service router that will then forward the request to the target cluster.

The Signadot proxy server periodically pulls the IP lists for all services from the xDS-server endpoint. Once a developer runs Signadot CLI on their local machine, the CLI pulls the IP lists from the Signadot proxy server and stores one IP for each service in the local/etc/hosts file.

Now developers can send requests to any service across multiple clusters from their local workstations. This efficient design negates concern about local requests overwhelming the internal DNS server.

Blue arrows in the diagram show how our system propagates a request from the local workstation to Cluster 2:

A developer's local workstation connects to Cluster 1 and runs Signadot CLI. A developer sends requests to service B's endpoint with its global URL. At this point, the local workstation knows the IP of the internal router. The local workstation makes a request to service B with one of the IPs. The request is sent to the Signadot proxy server in Cluster 1 through a TCP tunnel made by the Signadot CLI.
Then the Signadot proxy server routes the request to an internal router. The router is implemented using an Envoy proxy.
The internal router then routes the traffic to a service router.
The service router ultimately forwards the request to service B running inside Cluster 2, successfully completing a developer's request to service B.

In this example, a developer is sending a request through Cluster 1. Note that developers can choose which cluster to connect to and therefore can send requests through any cluster. Now developers can test their local code by connecting to dependencies in any cluster.

Conclusion

Many development teams are slowed down by unreliable or slow feedback loops. The Signadot and multi-tenancy solution we describe here is suitable for companies that are growing very rapidly and need to develop product features quickly with high reliability.

Our new fast feedback loop has increased developer velocity by reducing the time to deploy code to the Kubernetes cluster more frequently; increased reliability because developers can easily test their code during the development phase, and boosted production safety through standardization and guardrails.

Comments