Atlantis Hardening and Review Fatigue

December 5, 2023 8 Minute Read General 0

Dmitriy Dunin

Dmitriy is a Security Engineer on the Application and Cloud Security teams at DoorDash. His focus is on secure design.

Ron Waisberg

Ron is a Security Engineer at DoorDash. He likes bugs and figuring out how things work.

Infrastructure-as-code and pull request automation

IaC enables a declarative, reusable, and auditable way to manage configuration changes. At DoorDash, the primary platform for this is Terraform, operated by an account-isolated or specifically configured Atlantis instance running in ECS and backed by GitHub.

This type of configuration can be used to manage a myriad of infrastructure, such as Okta, Stripe, Chronosphere, or AWS. For the purposes of this article, we’ll focus on AWS.

A basic workflow for creating an AWS Account could be as simple as creating a new GitHub repository from a template and then issuing a PR against a repository containing the IaC for the account managing the AWS Organization. Atlantis automatically plans on the newly issued PR, and an admin, engineer, or other authorized personnel reviews and approves the proposed changes as appropriate. Upon approval, someone with access to comment on PRs, such as the author or an approver, can leave the comment “atlantis apply,” instructing Atlantis to execute the proposed plan and merge the PR upon success.

Because the Atlantis instance is isolated to the specific AWS Account and only executes the plan post-approval, one would assume that this is a safe setup. However…

Bypassing approval

By default, Atlantis dutifully executes terraform plan in directories where changes to specific files, for example *.hcl, have been made. terraform apply cannot be run unless the PR has been approved. Terraform, however, is a flexible and powerful tool. Terraform providers execute code at plan time and can be pulled from outside the public registry. A user with the ability to open PRs could host, fetch, and execute a malicious provider to circumvent PR approval requirements. In fact, such a user wouldn’t even need to host a malicious provider. An official provider, external, contains a data source which can be used to tell Atlantis to do pretty much anything.

The troubling fact is that the external data source can execute arbitrary code at plan time with the same privileges and in the same environment as Atlantis, allowing arbitrary changes to be made without any need for review or approval.

Plugging the leak

Atlantis has powerful server-side customization that allows customized default plan and apply workflows, provided it is not configured to allow repositories to provide their own configuration customization. This enables running tools such as Conftest against Open Policy Agent (OPA) policies that define an allowed list of providers before terraform plan is executed. Given the large number of providers available in the Terraform Registry and the means to use providers from unlimited sources, a strict allowlist of providers removes the ability to apply changes or leak environmental data at plan time.

To create such an allowlist, it’s important to let Terraform resolve its dependency graph instead of trying to parse required_providers because unapproved providers can be referenced by external modules and their transitive dependencies. Once the dependency graph is resolved with terraform init, all required providers can be found in the dependency lock file alongside version and checksum information. Here is an example server-side config validating an allowlist of providers against the dependency lock file:

repos:
- id: /.*/
  branch: /^main$/
  apply_requirements: [approved, mergeable]
  workflow: opa
workflows:
  opa:
    plan:
      steps:
        - init
        - run: conftest test --update s3::https://s3.amazonaws.com/bucket/opa-rules --namespace terraform.providers .terraform.lock.hcl
        - plan

A starter policy evaluating just the provider source address appears as follows:

package terraform.providers

allowed_providers = {
       "registry.terraform.io/hashicorp/aws",
       "registry.terraform.io/hashicorp/helm",
       "registry.terraform.io/hashicorp/kubernetes",
       "registry.terraform.io/hashicorp/vault",
}

deny[msg] {
       input.provider[name]
       not allowed_providers[name]
       msg = sprintf("Provider `%v` not allowed", [name])
}

With version and checksum information available in the dependency lock file, OPA policies could enforce not just certain providers but also non-vulnerable versions and known checksums.

With these precautions, if a bad actor attempts to use the dangerous data source in their HCL, Atlantis will halt before planning:

FAIL - .terraform.lock.hcl - terraform.providers - Provider `registry.terraform.io/hashicorp/external` not allowed

1 tests, 0 passed, 0 warnings, 1 failure, 0 exceptions

The developer experience can be improved by adding a prescriptive error message and defining a process for expanding the provider allowlist. Additionally, a feature can be added to the custom workflow to allow authorized users or groups in GitHub to permit a dangerous plan anyway with a PR comment.

Note that the above implementation relies on the existence of the dependency lock file (.terraform.lock.hcl), which did not exist prior to Terraform 0.14. We recommend enforcing a minimum version of Terraform to prevent downgrade attacks. If you need to support older versions of Terraform, “terraform version” returns provider information starting in 0.11 with JSON output added in 0.13.

Alternative approaches to implementing provider validation include hosting an internal registry and using a network mirror or baking providers into your image and using -plugin-dir.

Subscribe for weekly updates

Reducing review fatigue

Such a workflow can require quite a few people to get anything done. Consider: An engineer simply wants to update a configuration property, but everything requires a review. This can grind productivity to a halt and make for an unpleasant work day waiting to do something as simple as increasing a memory limit on an EC2 instance.

With Conftest and OPA, specific resources can be allow- or deny-listed, permitting some specific changes without needing approval while others would be specifically flagged for approval.

Additionally, approval for changes to specific properties can be delegated to non-specialized teams in GitHub by adjusting CODEOWNERS and writing the HCL in such a way that it reads the property values from non-Terraform files such as .txt files. For example:

locals {
  users     = var.users != null ? var.users : var.read_users_from_file == null ? [] : [for user in split("\n", chomp(file(var.read_users_from_file))) : user if trimspace(user) != "" && substr(trimspace(user), 0, 1) != "#"]
  set_users = toset(distinct(local.users))
}

The combination of these two techniques can pre-determine that a number of changes are explicitly safe, significantly reducing the need for review by a team member from security or infrastructure engineering.

Management nightmare

Recall the configuration of Atlantis. For safety, each AWS Account has its own instance of Atlantis so that a misconfigured or compromised instance in one account can’t make changes in another account. Each instance runs in Elastic Container Service (ECS) with separately configured containers. Every change to the workflow configuration currently requires a PR. In large AWS Organizations, this can result in a significant number of PRs creating a tedious process.

Presently, Atlantis is tedious to manage en masse. Simplifying this process is a priority, but requires planning. Some design changes can be made to help. For example, workflow configuration can come from a service or source control management system. Additionally, we can create limited-purpose cross-account AWS Identity and Access Management (IAM) Roles to permit updating of all Atlantis ECS Service Task Definitions and Services. Doing so, however, requires planning to limit unknown/unreviewed/unofficial images being used in the Task Definitions as well as monitoring of CloudTrail logs to reduce the chance of unauthorized changes.

Conclusion

Any sufficiently powerful tool is unlikely to come without risk, so it’s important to review the functionality of tools and systems in the critical path of a workflow. A misconfigured build environment could lead to remote code execution on a developer or continuous integration (CI) machine. A misconfigured PR automation system could lead to something similar or more unfortunate. Maintaining safe operations calls for addressing critical findings in reviews.

Simple roadblocks may provide security but often lead to fatiguing inefficiencies. Few people will continue to use a secure system that they don’t enjoy or that bogs down the entire process. Being mindful of this provides opportunities to explore ways to reduce inefficiency while maintaining excellent security, increasing developer velocity, and reducing fatigue.

Batten down the hatches, full steam ahead!

Making Deliveries More Accurate with Improved Location Information

As a DoorDash customer, you should always know where your order is in the delivery journey. Whether the Dasher is on the way to the restaurant, waiting for your food, or nearing your location, the DoorDash app keeps you up to date every step of the way. In the past, we’ve typically used GPS information ...

Xilin Liu 3 Minute Read

General

Designing an On-Demand Logistics System

At DoorDash, we’re building more than just an app. We’re building a system of products to enable on-demand delivery for local cities. People don’t use DoorDash because we have a pretty, easy-to-use app that allows you to order food. People use DoorDash because we provide the fastest and most reliable delivery service. At the end ...

Stanley Tang 5 Minute Read

Backend General

Optimizing OpenTelemetry’s Span Processor for High Throughput and Low CPU Costs

When companies move to microservices, they need to address a new challenge of setting up distributed tracing to identify availability or performance issues throughout the platform. While various tools offered on the market or through open-source perform this task, there is often a lack of standardization, making leveraging these tools costly or complicated. As DoorDash ...

Santosh Banda 24 Minute Read

Culture General

6 questions with DoorDash’s New VP of Engineering, Liangxiao Zhu

We’re thrilled to welcome Liangxiao, our first VP of Engineering, to DoorDash!

DoorDash 5 Minute Read

General Web

Things to Keep in Mind When Integrating a Map Feature to a Web App

Lessons for developing a fast, flexible, and scalable map feature on web

Ying-Chun Wang 11 Minute Read

Backend General

Enforce Timeout: A DoorDash Reliability Methodology

“What would happen if we removed statement timeouts in our Postgresql databases?” That’s one of the questions asked in a management meeting. At the time I only responded that it would be bad — it would cause problems and make it harder to debug. However, I realize now that this is a topic that many people don’t ...

Zhaobang Liu 5 Minute Read

General

From Monolith to Microservices: Reducing the Migration’s Pain Points

In this second article in our monolith to microservices series we talk about the benefits of the monolith we needed to replace or mitigate

Ivar Lazzaro 17 Minute Read

General

Four Challenges When Launching a Product Partnership

From a product engineering perspective, external partnerships can be tricky. Here are four best practices to follow.

Manori Thakur 7 Minute Read

General

Unleashing Your Potential: 5 Strategies to Identify Breakout Leadership Opportunities in Tech

Contrary to popular belief that the key to an exceptional career is the accumulation of skills and experience over time, I believe that taking advantage of breakout opportunities is a game-changer in your career. Characterized by their high-visibility or high-impact nature, these breakout opportunities can propel your career to new heights as you meet their ...

Gayatri Iyengar 6 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

Atlantis Hardening and Review Fatigue

Dmitriy Dunin

Recent Posts

Ron Waisberg

Recent Posts

Infrastructure-as-code and pull request automation

Bypassing approval

Plugging the leak

Subscribe for weekly updates

Reducing review fatigue

Management nightmare

Conclusion

Popular Posts

You May Also Like

Making Deliveries More Accurate with Improved Location Information

Designing an On-Demand Logistics System

Optimizing OpenTelemetry’s Span Processor for High Throughput and Low CPU Costs

6 questions with DoorDash’s New VP of Engineering, Liangxiao Zhu

Things to Keep in Mind When Integrating a Map Feature to a Web App

Enforce Timeout: A DoorDash Reliability Methodology

From Monolith to Microservices: Reducing the Migration’s Pain Points

Four Challenges When Launching a Product Partnership

Unleashing Your Potential: 5 Strategies to Identify Breakout Leadership Opportunities in Tech