Using IaC aggressively

SHARE

We're already using IaC, how is this different?

More often than not, when we or our customers use Terraform, it is used as a tool for provisioning cloud infrastructure, and only cloud infrastructure. 

That is, using it to, e.g.:

  • Create managed networks
  • Create VMs
  • Enable services
  • etc.

See the pattern there? It's really just the Cloud infrastructure, which rarely is any useful on its own. A typical cloud deployment not only has more advanced infrastructure, such as serverless services or a K8s cluster in place of VMs, but it also has custom application code.

Simplicity vs. practicality

Using a simple tool that only does one thing is the central point of many ideologies.

The Unix philosophy:

Do one thing and do it well

The suckless philosophy (though for more reasons than just this one):

... keeping things simple, minimal and usable...

However, the world of public cloud is, for better or for worse, not made by these philosophies, and so I believe that there should be a tool for bridging these philosophies to it. Sometimes, things need to be done to keep the whole infrastructure - perhaps including the application code - together, which is something that Terraform does not have a way of providing for.

Using an IaC monorepo

The biggest gripe I personally have with how Terraform - or any other form of IaC - tends to be used is how users tend to shy away from actually fully adopting it. Specifically, how most code projects contain just a single project (with or without the "project" resource itself), which then is deployed manually. This actually sometimes takes more effort than just simply doing the deployment manually.

It is very possible to manage the whole entirety of one's infrastructure with Terraform, although details differ per platform.

On GCP, you may manage almost everything through Terraform -- that is, everything but the creation of the organization, and it may even be simpler to grasp afterwards.

  • google_organization data source - i.e. you cannot create the org through Terraform, but once you have it created, you can still get metadata from it
  • google_folder resource - now this is a resource
  • project-factory module - this is a module from Google that helps with the creation of projects
  • org-policy module

Contractor access & separation

You might worry that you'll be forced to allow e.g. external contractors into this large codebase, and that they'll be able to see everything. But this is not something that an IaC tool should have to account for - this can and perhaps even should be solved via the code platform - generally Git on some hosting site, such as GitHub and GitLab.

A very similar worry may be that having everything in one place may produce a potential bomb - and that worry is absolutely reasonable, but then again, having everything as a single codebase allows a linter to catch many issues even before you think of committing the code.

  • If contractors seeing code is not an issue, than the issue can be solved by using the access roles feature (where only specific people can work on the main/master/trunk/etc. branch), and some Git flow processes such as Trunk Based Development.
  • But if contractors seeing code, and the code separation is an issue, this can be solved through features such as Git's submodule system: Letting said contractors develop stuff in a separate repository, and then including the repository into the large repository as a Git submodule, so when deploying the infrastructure code may seem to be one large codebase.

Providers are just like resources

You can create a Terraform provider block the same way you create a standard resource. That is, you can parameterize the contents of the block.

The code examples below shows this quite clearly.

This is taken from the K8s provider documentation on terraform.io and shows an EKS setup:

provider "kubernetes" {
host = var.cluster_endpoint
cluster_ca_certificate = base64decode(var.cluster_ca_cert)

exec {
api_version = "client.authentication.k8s.io/v1beta1"
args = ["eks", "get-token", "--cluster-name", var.cluster_name]
command = "aws"
}
}

As you can see, though, this approach pollutes the codebase a bit by the fact that you now need to ensure that whatever or whoever is running the codebase has aws installed. Credentials should not be an issue, though, as the command will see the same credentials that you passed through environment variables to terraform, as Terraform is executing the command.

On GCP / GKE, this can be solved in a similar fashion:

provider "kubernetes" {
host = "https://${data.google_container_cluster.my_cluster.endpoint}"
token = data.google_client_config.default.access_token
cluster_ca_certificate = base64decode(data.google_container_cluster.my_cluster.master_auth[0].cluster_ca_certificate)
}

Kubernetes resources

Terraform and probably all other forms of IaC utilize the fact that Kubernetes resources tend written in a common, serializable way; YAML. And since HCL - the language that you write Terraform code in - has the same things (maps, lists, etc.), then all resources can easily be converted to HCL.

The two following files are basically the same thing:

resource "kubernetes_deployment" "nginx" {
metadata {
name = "nginx"
}
spec {
replicas = 2
selector {
match_labels = {
app = "nginx"
}
}
template {
metadata {
labels = {
app = "nginx"
}
}
spec {
container {
image = "nginx"
name = "nginx-container"
port {
container_port = 80
}
}
}
}
}
}

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx-container
image: nginx
ports:
- containerPort: 3000

While it may actually seem that the two languages have their differences, note that HCL uses e.g. the port {} block, which may be written multiple times in the same block - it is a list structure. If it was written as ports = {}, it would have been a map.

The bootstrap issue

It does not exactly feel good to be told that you can manage everything via Terraform and then that the bootstrap issue is still there, I know.

In case of both of the cloud providers that we use - that is AWS and GCP - you need to have an account or a project respectively from which you need to get credentials; an IAM user or an IAM service account respectively.

You'll also ideally need to create a bucket on either S3 or GCS, which is used for the storage of "state", which really is unavoidable, as it's used for the IaC tools to know what they're supposed to manage even after you potentially exclude it from the code. i.e.: After you deploy something through Terraform, and then delete it, Terraform needs to know what it created in the past so that it can delete it.

On GCP specifically, you'll also need to pre-create a Billing Account, but the management of the billing account afterwards can be done in IaC.

But after that, that's it, that's all there needs to be pre-created, and from here on everything can be done via Terraform, or Pulumi, as you'll read in further down.

Enter Pulumi

In the simplicity vs. practicality section, I mentioned that there should be a tool that bridges the philosophy of tool simplicity with the reality of how the APIs and other interfaces of public cloud platforms are.

That tool might be Pulumi. It's conceptually like Terraform, but with some major differences:

  • The infrastructure code is not HCL. It's TypeScript, Go, Python, Java, C#, and even YAML at the time of writing this article.
  • It can do any "extra things" that the programming/scripting languages can
  • The "resource parent" option

But there is one thing it remarkably shares with Terraform: The Terraform providers. That's right, it can use Terraform's provider binaries. But it's not always the case, and case-in-point is the Pulumi Google Native provider, which the creators of Pulumi have created to bypass some of the shortcomings of the "correctly made" Terraform provider. Then there's also the Kubernetes Extensions / kubernetesx provider, which works with the same concept, but for Kubernetes.

Note that Pulumi with YAML cannot do some things that Pulumi with other languages can, like resource lookup.

Now, some might say that Pulumi is definitely not simple, and they're right; internally, Pulumi is not exactly simple. But it does expose a simple and relatively consistent interface - the code, or more specifically the libraries for all the supported languages. Then again, some might say that you should not need to have a full-fledged programming language for deploying infrastructure, and they're right again. But there are some times that you need to do something that's not exactly "nice", like making an HTTP request to a resource after it's been created and before anything else is created. Terraform simply cannot do this on its own, but languages like Python and JS can.

To showcase some of the most interesting features, I've elected to use TypeScript in the following code example. It could be code for a development/testing/production GKE cluster setup - it creates its own folder and has a project resource templating function:

(note that I ran npm install for all the imported libraries)

import * as pulumi from "@pulumi/pulumi";
import * as random from "@pulumi/random";
import * as gcp from "@pulumi/gcp";
import * as k8s from "@pulumi/kubernetes";

interface Metadata {
    slugShort: string;   // test
    slugLong: string;    // testing
    displayName: string; // Testing
}

interface Settings {
    billingAccountId: string;   // 012345-678901-234567
    parentDir: string;          // 012345-678901-234567
    region: string;             // europe-west1
    zone: string;               // europe-west1-b
}

// Here, we gain access to the input variables, called the stack config
let config = new pulumi.Config();
// From the stack config, we can get our custom data, like metadata and settings here
let metadata = config.requireObject<Metadata>("metadata");
let settings = config.requireObject<Settings>("settings");

// Generate a random ID to use as a suffix for uniquely-named resources on GCP
const deploymentId = new random.RandomString("deploymentId", {
    length: 6,
    special: false,
    lower: true,
    upper: false,
    number: true,
}).result

const folder = new gcp.organizations.Folder(metadata.slugLong,
    {
        displayName: metadata.displayName,
        parent: `folders/${settings.parentDir}`,
    }
);

// Templated project + provider generator.
//
// For the input args, I've just copied over the types from the
// `gcp.organizations.Project` constructor.
const providerProjectTemplate = (
    resourceId: string,
    projectName: string | Promise<string> | pulumi.OutputInstance<string>,
    projectId: string,
    folderId: string | Promise<string> | pulumi.OutputInstance<string>,
    apiServices: Array<string>
) => {
    const project = new gcp.organizations.Project(resourceId, {
        // The pulumi.interpolate templating function awaits the projectId and deploymentId promises
        projectId: pulumi.interpolate`${projectId}-${deploymentId}`,
        name: projectName,
        billingAccount: settings.billingAccountId,
        folderId: folderId,
        autoCreateNetwork: false,
    }, { parent: folder });

    const apis = apiServices.map(apiService => {
        return new gcp.projects.Service(
            apiService,
            {
                // We can't use the project as a provider, so
                // we explicitly define project here, and the
                // provider itself depends on the api list
                project: project.projectId,
                service: apiService,
                disableDependentServices: true,
            }, { parent: project }
        );
    });

    // The provider dependsOn apis so that we can use the project with the
    // enabled APIs as-is, as the APIs are, from an outside POV, considered
    // to be a part of the project
    const provider = new gcp.Provider(resourceId, {
        project: project.projectId,
        region: settings.region,
        zone: settings.zone,
    }, {
        dependsOn: apis,
        parent: project
    });

    return provider;
};

const gcpProvider = providerProjectTemplate(
    "gcp",
    // We can use a normal templating string here since metadata is not a promise
    `rud-${metadata.slugShort}-pulumi`,
    "rud-test-pulumi",
    folder.folderId,
    ["container.googleapis.com", "compute.googleapis.com"]
);

const vpc = new gcp.compute.Network("mainVpc", {
    name: "main",
    autoCreateSubnetworks: false,
}, {
    provider: gcpProvider,
    parent: gcpProvider
})

const subnet = new gcp.compute.Subnetwork("gkeSubnet", {
    network: vpc.id,
    name: "gke",
    ipCidrRange: "10.0.0.0/28",
}, { parent: vpc });

const gkeServiceAccount = new gcp.serviceaccount.Account("gkeServiceAccount", {
    accountId: "gke-cluster",
    displayName: "GKE Cluster",
}, {
    provider: gcpProvider,
    parent: gcpProvider
});

// Here I create the GCP resource which creates a GKE cluster
const gkeCluster = new gcp.container.Cluster("cluster", {
    name: "pulumi",
    removeDefaultNodePool: true,
    network: vpc.selfLink,
    subnetwork: subnet.selfLink,
    initialNodeCount: 1,
}, { parent: subnet });

const defaultNodePool = new gcp.container.NodePool("preempt", {
    cluster: gkeCluster.name,
    nodeCount: 3,
    nodeConfig: {
        preemptible: true,
        machineType: "e2-medium",
        serviceAccount: gkeServiceAccount.email,
        oauthScopes: ["https://www.googleapis.com/auth/cloud-platform"],
    },
}, { parent: gkeCluster });

// And here I create the K8s provider for K8s resources to use
const k8sProvider = new k8s.Provider("k8s", {
    kubeconfig: pulumi.all(
        [gkeCluster.project, gkeCluster.name, gkeCluster.endpoint, gkeCluster.masterAuth]
    ).apply(([project, name, endpoint, masterAuth]) => {
        const context = `${project}_${name}`;

        return `
        apiVersion: v1
        clusters:
        - cluster:
            certificate-authority-data: ${masterAuth.clusterCaCertificate}
            server: https://${endpoint}
          name: ${context}
        contexts:
        - context:
            cluster: ${context}
            user: ${context}
          name: ${context}
        current-context: ${context}
        kind: Config
        preferences: {}
        users:
        - name: ${context}
          user:
            auth-provider:
              config:
                cmd-args: config config-helper --format=json
                cmd-path: gcloud
                expiry-key: '{.credential.token_expiry}'
                token-key: '{.credential.access_token}'
              name: gcp
        `;
    })
}, {
    parent: gkeCluster, dependsOn: [defaultNodePool]
});

const appNamespace = new k8s.core.v1.Namespace("app", {
    metadata: { name: "app" },
}, { provider: k8sProvider, parent: k8sProvider });

Note how a new pulumi:providers:gcp is created after the project; this is after we create our own project in code, and all its children can then inherit the new project ID and the default region and zone. This unfortunately does introduce a potential error when somebody forgets to include the parent opt, in which case whatever resource they're creating gets created in the "stack default" project, if it is set (in the worst case scenario you'll just have to wait until it gets recreated, best case scenario it will fail immediately due to not being in the same project) - the top-most gcp:organizations:Folder inherits its settings (note how it doesn't have a parent in the code above) from the stack defaults, which I have set to be the bootstrap project via pulumi config set gcp:project <bootstrap-project-id>.

The vpc - and anything else that doesn't have a parent that already has the provider - has to have the provider opt set to the locally created one due to the fact that setting a provider as a parent doesn't make the provider the provider, unfortunately.

The same thing happens with the Kubernetes provider - it waits for the GKE cluster to be created due to the fact that it depends on the values provided by it, and I construct the config file it needs manually with interpolated values. And that provider is then used as a parent and provider for K8s resources, of course.

This is the output of the code as it is so far:

     pulumi:pulumi:Stack                                   pulumi-demo
├─ random:index:RandomString deploymentId
└─ gcp:organizations:Folder testing
└─ gcp:organizations:Project test
├─ gcp:projects:Service compute.googleapis.com
├─ gcp:projects:Service container.googleapis.com
└─ pulumi:providers:gcp gcp
└─ gcp:compute:Network mainVpc
└─ gcp:compute:Subnetwork gkeSubnet
└─ gcp:container:Cluster cluster
└─ pulumi:providers:kubernetes k8s
└─ kubernetes:core/v1:Namespace app
...

Building & deploying containerized apps through Pulumi

Finally, you can build the apps that the cluster is going to use directly within Pulumi.

Prerequisites:

  • An app with a Dockerfile / Containerfile
  • Docker provider: npm install @pulumi/docker
  • Artifact Registry enabled (by adding the API name into the project services array): artifactregistry.googleapis.com

First, create an Artifact Registry repository for Docker containers and give GKE access to it:

const containerRepo = new gcp.artifactregistry.Repository("registry", {
format: "DOCKER",
repositoryId: "pulumi-demo",
description: "Pulumi demo",
location: subnet.region,
}, {
provider: gcpProvider,
parent: gcpProvider
});

const containerRepoUrl = pulumi.interpolate`${containerRepo.location}-docker.pkg.dev/${containerRepo.project}/${containerRepo.repositoryId}`;

const gkeToDockerRepo = new gcp.artifactregistry.RepositoryIamBinding("gkeToDockerRepo", {
repository: containerRepo.name,
location: containerRepo.location,

role: "roles/artifactregistry.reader",
members: [gkeServiceAccount.member]
}, { parent: containerRepo });

GKE itself should also depend on the repo IAM binding so that a situation where it may not reach the repo can't happen.

Note that the Artifact Registry repository (which succeeded Google's Container Registry) is not actually a place for images directly - it really is just a repository, so the image name comes after. The repo URL unfortunately has to be constructed manually as of the time of writing this, but fortunately it's got consistent naming.

Add the Docker library and the image itself with a path to a directory with a Dockerfile and the name of the image:

import * as docker from "@pulumi/docker";


const appImage = new docker.Image("app", {
build: "/home/user/src/app",
imageName: pulumi.interpolate`${containerRepoUrl}/sample`,
}, { parent: containerRepo });

Now, when the pulumi up command is ran, Pulumi builds the container during the "Previewing update" phase (i.e. before even actually asking for "yes" / "no" for the deployment), and then when the deployment update happens, it is then pushed to the registry based on the image name / URL.

The appImage.imageName attribute now contains the full URL + the image hash, which we can use to refer to this specific container image.

new k8s.apps.v1.Deployment("sample", {
metadata: {
namespace: appNamespace.metadata.name,
name: "sample",
},
spec: {
replicas: 3,
selector: {
matchLabels: {
app: "sample"
},
},
template: {
metadata: {
labels: {
app: "sample"
}
},
spec: {
containers: [
{
name: "sample",
image: appImage.imageName,
ports: [{ containerPort: 80 }]
}
]
}
}
}
}, { parent: appNamespace });

With this code, Pulumi will wait until the container is actually set-up and running, which ensures that when the update actually finishes successfully, then everything is probably OK.


Summary

You can use a single Terraform codebase to provision most of your infrastructure, but Pulumi can do the same and build apps that run on the infrastructure. How cool is that?

If you want to focus on the core of your business operations, don't have time to hire and build an in-house cloud-ops team, let us take care of your cloud infrastructure. Revolgy offers a skilled extension of your team when it comes to ongoing maintenance and on-demand scaling of your public cloud infrastructure. 

We provide:

If you are interested in learning more about Cloud Operations or need some advice, don't hesitate to get in touch.