What I Learned at Work this Week: More Terraform

Mike Diaz
5 min readJul 16, 2023
Photo by Dapur Melodi: https://www.pexels.com/photo/yellow-and-brown-metal-pay-loader-on-he-dirt-1009926/

Since the last time I wrote about Terraform, I’ve been blessed to have avoided too many challenging problems with the framework. In truth, I had a few teammates who took it upon themselves to become infra experts, which insulated me from having to dig too deep into this part of the codebase. Now I’m on a new team, and I don’t have the same luxury, so it’s time to start brushing up.

The Problem

This week, I was tasked with writing a Python script that would make a request to a client API and return a JSON payload of product data. The client obviously wants their data to be secure, so we had to register as a user, which assigned us an API key as well as a client_id and client_secret. We had to pass all these values into our request just to get an OAuth token, then use that token to get the actual data.

Long story short, I didn’t want our credentials to end up on GitHub along with my Python code, so I saved them as a secret using AWS Secrets Manager. But when I tried to run the script through Airflow, I got an AccessDeniedException saying that my iam is not authorized to perform GetParameter on resource.

Choosing your module

I learned that our Airflow jobs use a default role that doesn’t have many AWS permissions. But my team has our own module, so I switched to that. Reading this code, I expected it would work:

statement {
sid = "AllowGetSecretValue"
actions = [
"secretsmanager:GetSecretValue"
]
resources = [
"arn:aws:secretsmanager:us-east-1:${account_id}:secret:${terraform.workspace}/tactical/*"
]
}

In terraform, a statement can be part of an aws_iam_policy_document, which determines specifics for our AWS role. The properties we used are:

  • sid: Just an identifier for the statement
  • actions: What we’re allowed to do. This is exactly what we need, permissions to run GetSecretValue.
  • resources: We have to specify where the rule applies. We interpolate our account_id and the workspace we’re running (dev, staging, or prod). By using the wildcard at the end, we say “any secret that starts with workspace/tactical/ can be read.”

My job still failed, but I got a new error: ResourceNotFoundException: Secrets Manager can't find the specified secret. I was running my job in a staging environment and I had added a secret while on the staging role. I triple-checked the name and it matched what I had written in the code, so I started looking more closely at the Terraform code. I found an env directory filled with tfvars files. One file, called staging.tfvars, looked like this:

kms_keys                 = ["arn:aws:kms:us-east-1:{the_STAGING_arn}:key/{some_key}", "arn:aws:kms:us-east-1:{the_PROD_arn}:key/{some_key}"]
dev_role_in_prod_account = true
airflow_role_arn = "arn:aws:iam::{the_PROD_arn}:role/devops"
variables = {
env = "prod"
}

We’re storing some custom variables here, let’s try and guess what they mean:

  • kms_keys: KMS stands for Key Management System. It looks like we’re using this array to associate keys with account IDs.
  • dev_role_in_prod_account: This is interesting…maybe my job is reading from the prod account instead of staging, which is where my secret lives.
  • airflow_role_arn: If this is the staging variable, why does it have the prod arn (Amazon Resource Name)?
  • variables: …and why does it appear to be setting an environment to prod?

If vars associated with staging have prod ARNs, this could start to explain the issue I’m seeing. To further investigate, I went back to the main file and searched for some of these vars. That’s where I found the service role.

Service Role

AWS Docs explain that a service role is like a user account, but it is designed for access by multiple users, so it’s like a generic set of permissions. I found a service role that mentioned the curious variable dev_role_in_prod_account:

module "service_role_airflow" {
source = "../modules/iam/k8s-service-role"
service = "team-airflow"
namespace = "airflow"
policy = data.tactical-api-policy.json

dev_role_in_prod_account = var.dev_role_in_prod_account
allowed_role_arn_list = [data.outputs.worker_role_arn]
allowed_oidc_arn = data.outputs.oidc_arn
}

Once again, we don’t know what everything here is supposed to do, but it does point us to a source in another directory. Reading that service module, I started to put the pieces together, thanks largely to this helpful comment:

# on dev, we assume a role in prod, which requires a copy of our OIDC provider for the dev *cluster* exists in the prod *account*
oidc_arn = var.dev_role_in_prod_account ? replace(var.allowed_oidc_arn, "/:\\d+:/", ":{the_PROD_arn}:") : var.allowed_oidc_arn

is_prod_sep_workspace = contains(["dev-dev", "staging-staging"], terraform.workspace) ? true : false

OIDC stands for OpenID Connect, which is an authentication protocol. It looks like what we’re saying here is that we’re going to overwrite whatever arn we had before (likely dev or staging) with prod. This would explain why the secret I had added using the staging role wasn’t being found.

We also see a new variable, is_prod_sep_workspace, which references workspaces called dev-dev and staging-staging. I had seen these among the tfvars files earlier, and when I checked again, I could see that they had the expected arn values, instead of prod. While I was investigating, I was also asking around to see if anyone on my team had experienced something similar. I learned that our infrastructure had undergone a decoupling process because it had originally been set up completely on the prod ARN. I think that there are two different vars files for staging and for dev because one references the *new* unique role and the other references the original. But clearly, for the module I was using, it routed us back to prod even when running staging.

Loose Ends

I can’t read Terraform well enough to understand what I should change to make this work the way I wanted it to. To finish my testing, I put a duplicate of my secret into prod, and it worked, which confirmed my theory. What’s most important is that I’m getting into our terraform codebase when it’s relatively small. Hopefully, I can follow future changes as they happen and make sure I’m familiar with what permissions we have and where various roles point.

Sources

--

--