Using DBT on Cloud Run Jobs: A Secure Terraform Approach

Oct 22, 2024 9:26:23 AM

When we started to build a Data Warehouse (DWH) at Hunt & Hackett, the goal was to use it for training Artificial Intelligence (AI) models and empowering comprehensive data analytics. But as any data engineer knows, the path from raw data to AI-ready insights is paved with ETL pipelines and data transformations.

Enter data build tool (dbt) - our SQL-implementation of choice for our AI-based Data Warehouses. But how do you deploy dbt in a cloud environment? In this blog post, we'll dive into how to secure your secrets, how you can implement the principle of least privilege with fine-grained IAM policies, and how you can easily add a new dbt SQL model with one pull request. This is all done using Terraform - no web browser is needed.

To tackle this challenge, the data team devised a data management strategy. Because our precise AI and analytics use cases were not clear yet at the time of development, we settled on the Medallion Architecture to organise our DWH. Initially coined by Databricks, the Medaillon Architecture organises data into a raw section ('bronze'), a filtered/cleaned section ('silver'), and a production section ('gold'). We found this approach of data management both flexible and relatively organised, making it a perfect start for our DWH's data organisation.

image-20240621-085612

Figure 1. Medallion Architecture as proposed by Databricks

To implement such data architecture into our DWH of choice (Google BigQuery), we searched for several solutions. Our main requirement was that the data is manageable with a mainstream programming language, preferably Python or SQL. Also, we wanted to store our source code in Git and either maintain the tool as a container, or as a fully-managed solution.

Based on this, we eventually narrowed down our search into two options: Google Dataform and data build tool (dbt). Both enable an user to use a modularised approach to writing SQL to manage data in both DWHs and traditional RDBMS systems.

Eventually, we settled on dbt. Despite being a managed solution with great integration with BigQuery, Dataform couldn't handle an extensive schema of more than 1.000 columns (which was part of our initial use cases). Dbt could handle large schemas and provided more authentication options. Also, the documentation and community support of dbt were more mature in comparison with Dataform, giving it the last reason to go for dbt as the data management tool for our DWH.

The Challenge

Despite the data management benefits of dbt, deploying it in production environments can be challenging. First, how do you deploy dbt? Since dbt is a Python-based command-line tool, should you use it with an API-like interface? Or, alternatively, should you set up a VPS to host dbt with predefined SQL jobs?

On the other hand, we already had an existing DWH on Google Cloud Platform (GCP). How could we interface dbt to our DWH in a secure way? And how could we do that both efficiently and relatively cheaply? We had the following requirements for our dbt deployment:

The dbt deployment should be on a controllable environment (preferably on Google Cloud Platform);
The deployment can be managed as a Terraform module;
It can run dbt 'jobs', e.g. specific SQL models;
The deployment and accompanying dbt jobs have dedicated role-based access control (RBAC) within our DWH project;
The RBAC implementation follows the Principle of Least Privilege (POLP), meaning that a dbt job only can access the data it needs to interact with.

After defining our requirements, we started researching dbt's documentation about deployments. We discovered that there are two ways to do it. You can go with dbt's proprietary dbt Cloud managed solution, or use the open-source dbt Core variant. Because dbt Cloud only offers RBAC features and the EMEA deployment region for Enterprise plans (which has 'custom pricing'), we opted for the open-source variant. With dbt Core, we could configure our RBAC with POLP, deploy it in Europe, and directly control the costs in our environments.

Also, we did a little experiment. We attempted to deploy dbt Core, which operates as a Python application, as an API-based service. We initialised a dbt project, created a Python script to handle API calls, and manually deployed it on GCP's Cloud Run service. However, this setup led to unforeseen issues: the service ran 24/7 which incurred unnecessary costs, and required redeployment for every SQL model change, making it an inefficient solution both technically and financially.

Solving the Challenge

During our experiment, running dbt on Cloud Run offered various benefits. It can run containerised front- and backend applications without the need for maintaining its hardware infrastructure. To not run dbt as an 24/7 API (and therefore costing us more money) we opted for the 'Jobs' feature of Cloud Run. Jobs use the same managed-hardware features as a Cloud Run Service, but a Job only runs a defined task and quits when done.

To adhere to our requirements and implement dbt on Cloud Run Jobs, we went back to our literal drawing board. Here we designed a cloud architecture which enables us to develop with dbt locally, and deploy our dbt SQL models as Cloud Run Jobs:

dbt_cloud_run_jobs-20240430-131952

Step-by-step explanation:

A project folder with two GCP projects are created. One project is for the dbt job CI/CD process, the other project is the dedicated data warehouse (DWH). Our DWH of choice is BigQuery.
A GitHub repository with an empty dbt project & Dockerfile is connected to Cloud Build in the CI/CD project.
Cloud Build builds the Dockerfile to a container image and stores it in Artifact Registry.
A Cloud Run Job and an accompanying Cloud Scheduler cron job is created based upon the stored dbt image.
A dbt job runs when the Cloud Run Job gets triggered by the Cloud Scheduler.
a. The dbt job will retrieve its SQL models from a dedicated dbt SQL models repository.
b. The dbt job project will first install the SQL models as a dbt package.
c. The dbt job will specifically select the to-be run SQL model from an environment argument.

All the elements of the solution are build with Terraform where possible.

RBAC and POLP

To limit the exposure of data and to lower the attack surface of the dbt implementation, we designed a POLP-based RBAC for the dbt jobs:

A YAML configuration defines dbt jobs with the following properties:
1. dbt model name
2. Cron schedule
3. BigQuery dataset roles - Per BigQuery dataset, one of the following roles is defined: data_editor, data_viewer, data_owner, or admin.
Terraform loops over the configuration, and creates a Cloud Run Job per dbt model. For every job, a unique service account is created.
In the BigQuery Terraform resource, the respective service accounts are assigned to their dataset roles, which are assigned to the specific dataset via a dataset IAM policy.

Implementing dbt on Cloud Run Jobs

Creating the dbt repositories

The discussion on how to set up a dbt project in a git repository has had lots of debate on your favourite coding forum. As always, the answer is it depends and the developers at dbt Labs have written a very good piece about it. Despite the presented options by dbt, we decided to go for a two-repository approach.

One repository (or 'image' repository) will contain a Dockerfile which containerises the project and in turn installs the other repository as a dbt package. The second repository (or 'model' repository) will function as a regular dbt project, containing all the SQL models and the project configuration.

The reason why we're separating the SQL models from the containerisation is that it allows us to independently manage development cycles and dependencies for both the container and the SQL models. Managing our dbt SQL models in one repository enables focused version control and isolated testing environments for the dbt models themselves.

On the other hand, maintaining an optimised Dockerfile in a separate repository ensures that our image builds and deployment processes can be refined without interfering with SQL model development. This separation enhances overall maintainability, simplifies code updates, and reduces the complexity associated with combined repositories.

In our Terraform configuration, we create both repositories as follows:

resource "github_repository" "github_repo" {
  name                 = var.name
  description          = var.description
  homepage_url         = "https://github.com/${var.managed_from_org}/${var.managed_from_repository}"
  visibility           = var.visibility
  vulnerability_alerts = var.vulnerability_alerts
  auto_init            = var.auto_init

  has_downloads        = var.has_downloads
  has_issues           = var.has_issues
  has_projects         = var.has_projects
  has_wiki             = var.has_wiki

  allow_rebase_merge   = var.allow_rebase_merge
  allow_squash_merge   = var.allow_squash_merge
  allow_merge_commit   = var.allow_merge_commit
  allow_auto_merge     = var.allow_auto_merge
  allow_update_branch  = var.allow_update_branch

  squash_merge_commit_title   = var.squash_merge_commit_title
  squash_merge_commit_message = var.squash_merge_commit_message
  
  delete_branch_on_merge = var.delete_branch_on_merge

  archive_on_destroy = true
  lifecycle {
    ignore_changes = [
      allow_squash_merge,
      allow_update_branch,
      delete_branch_on_merge,
      merge_commit_message,
      merge_commit_title,
      squash_merge_commit_message,
      squash_merge_commit_title,
      allow_auto_merge,
    ]
  }
}

Also, you can add branch protection and configure further user and team permissions. More on the GitHub Terraform provider can be found here.

Configuring the dbt repositories

With the repositories in place, we can configure them to work with our Terraform design. Let's configure the SQL models repository first.

SQL Models repository setup

The dbt SQL model repository setup is rather simple. We start with initialising a Python Poetry project:

poetry init

During the Poetry setup, we install the following package dependency:

`dbt-bigquery = "^1.7.7"`

The dbt-bigquery dependency will also include the dbt-core Python package. After the Python package management, we will create our dbt project in our Poetry shell:

dbt init

This will provide us with the dbt file and folder structure. Let's configure the profiles.yml:


h2_data_warehouse_sql_models:
    target: prod
    outputs:
        prod:
            type: bigquery
            method: oauth
            project: ""
            dataset: staging # Fill in the targeted dataset
            threads: 5
            location: EU
            priority: interactive
            retries: 2

The only change we added is the GCP_PROJECT_ID. This is an environment variable which will be set during the runtime of the Cloud Run Job. With the use of this variable, we're able to deploy our SQL models in any containerised dbt environment independent from the Google Cloud project where the data is living. More on the workings of the env_var function can be found here.

Another change we'll make to the 'vanilla' dbt project is the addition of the properties.yml file. The file is created inside the 'models' folder. This file has two goals: documenting our SQL models and describing our DWH sources. This is done as follows:


version: 2

sources:
  - name: "raw_trainings_data"
    project: ""
    dataset: "raw_data"
    tables:
      - name: "source_1"
        identifier: "source_1"
      - name: "source_2"
        identifier: "source_2"
  - name: "intermediate_data"
    project: ""
    dataset: "intermediate_data"
    tables:
      - name: "table_1"
        identifier: "table_1"
      - name: "table_2"
        identifier: "table_2"

models:
  - name: raw_trainings_data.source_1
    description: "Unprocessed data from source 1."
    columns:
      - name: primary_id
        description: "A unique identifier for the row."

We add the tables consisting of the unprocessed data (which will function as source data) and add additional documentation to the tables and schemas.

Docker image repository setup

Now we get to the interesting part of the dbt configuration. The Docker image project will be deployed as Cloud Run Job on GCP and will use the SQL model repository as a dbt package. We’ll include the SQL model repo as follows:


packages:
  - git: "https://:@github.com/hnhdev/data-warehouse-dbt-models.git"
    revision: ""
  - package: dbt-labs/dbt_utils
    version: 1.1.1

We use the dbt packages feature to include our SQL model repository in our Docker image repository, which will be the Cloud Run Job. We're importing our dbt SQL project as a Git package, as explained in the dbt documentation. The package configuration uses three environment variables:

GITHUB_USERNAME - the username of a dedicated 'service account' which provides access to GitHub. This account is a non-human user which is used to control access independent from team fluctuations.
SECRET_TOKEN - We're using a token to authenticate with the other GitHub repository. This enforces the POLP principle that only one account can access data-warehouse-dbt-models repository. The token is retrieved from our secret manager when the ‘job’ runs. The dbt project will use the token to authenticate with our SQL model repository and import it as a package.
DBT_REVISION - the repository commit hash or tag which represents the version of the package.

A small difference with the SQL models repository is the configuration of the dbt_project.yml file. We do not include anything underneath the models: part. This should be configured in the SQL models repository.

Dockerfile

The Dockerfile will represent the 'job' which essentially runs the dbt run --select {insert dbt model name} command on GCP's Cloud Run. The Dockerfile is set up as a multi-stage build: a 'builder' part and a 'runtime' part.

FROM python:3.11-slim-buster AS builder
ARG POETRY_VERSION=1.6.1

ENV \
    # Python doesn't create .pyc files.
    PYTHONDONTWRITEBYTECODE=1 \ 
    # Python prints application output/logs to the terminal.
    PYTHONUNBUFFERED=1 \
    # Python prints tracebacks of error to the terminal.
    PYTHONFAULTHANDLER=1 \
    # Turn off pip's cache.
    PIP_NO_CACHE_DIR=1 \
    # Disable pip's version check to fasten the build.
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    # Increase the timeout seconds for slow package downloads.
    PIP_DEFAULT_TIMEOUT=100 \
    # Poetry will not ask config questions.
    POETRY_NO_INTERACTION=1 \
    # Create Poetry virtualenv which mimics the local virtualenv
    # and is a caching strategy.
    POETRY_VIRTUALENVS_IN_PROJECT=1 \
    POETRY_VIRTUALENVS_CREATE=1 \
    # Specify Poetry's caching directory.
    POETRY_CACHE_DIR=/tmp/poetry_cache

COPY pyproject.toml poetry.lock ./
RUN pip install poetry==${POETRY_VERSION} && \ 
    poetry install && \ 
    rm -rf $POETRY_CACHE_DIR

FROM python:3.11-slim-buster as runtime

ENV VIRTUAL_ENV=/.venv \
    PATH="/.venv/bin:$PATH" \
    DBT_ENV_SECRET_PAT=""

RUN apt-get -y update && \ 
    apt-get install --no-install-recommends --yes git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    useradd -r dbt

WORKDIR /h2_data_warehouse
RUN chown dbt /h2_data_warehouse

COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
COPY --chown=dbt h2_data_warehouse vault.py /h2_data_warehouse/

USER dbt

ENTRYPOINT dbt deps && dbt run
CMD "--select"

The 'builder' part will create a virtual environment with Poetry and installs all the necessary dependencies. The environment variables seen in the 'builder' part are used to optimise the Poetry virtual environment (see the accompanying comments). A similar thing happens when the rm -rf $POETRY_CACHE_DIR command is run, which will delete the unnecessary cache of the virtual environment.

The 'runtime' part consists of the job itself. It updates the existing packages, installs git (needed for importing the dbt project as a package), and runs a dbt model. The bash commands managing the image will create a dedicated user called dbt to execute the job. This a non-root user who only has ownership of the working directory. The creation of a non-root user is done in line with our Principle of Least Privilege (POLP) requirement.

At the end of the Dockerfile, the USER step will change the context where subsequent commands are run. This starts with the ENTRYPOINT. The ENTRYPOINT installs the SQL model project as a dbt package with dbt deps. The image ends with the dbt run command, followed by the --select flag, enabling the selection of a specific dbt SQL model.

By importing our SQL dbt project and running it during the ENTRYPOINT and CMD instructions in the Dockerfile, we achieve two goals: we can reproduce and run our dbt 'job' everywhere and do that in line with our POLP principles. The last measures to secure our dbt setup are done in our Terraform setup.

Creating the CI/CD project

To build the Dockerfile and use it as a container on Cloud Run Jobs, we are going to setup a CI/CD project. A dedicated CI/CD GCP project enables us to control the development and maintenance cycles of the Docker image repository without the interference of the SQL model development cycles.

In our Terraform code, we create a new folder called data_build_tool_cicd (which will function as a Terraform module) and a new main.tf file. We initialise the following:

locals {
  gcp_project_apis = [
    "cloudbuild.googleapis.com",
    "artifactregistry.googleapis.com",
    "iam.googleapis.com"
  ]
}

resource "google_project" "project" {
  name                = var.name
  project_id          = var.project_id
  org_id              = var.org_id
  folder_id           = var.folder_id
  billing_account     = var.billing_account
  deletion_policy     = var.deletion_policy
  labels              = var.labels
  auto_create_network = var.auto_create_network
}

resource "google_project_service" "activate_apis" {
  for_each                   = toset(local.gcp_project_apis)
  project                    = google_project.project.project_id
  service                    = each.value
  disable_on_destroy         = false
  disable_dependent_services = false
}

resource "google_service_account" "dbt_image_builder" {
  project     = google_project.project.project_id
  account_id  = "dbt-image-builder"
  description = "Service account to build the DBT image within the Cloud Build CI/CD"
}

resource "google_project_iam_policy" "project_policy" {
  policy_data = jsonencode({
    bindings = concat(var.global_project_iam_members, [
      {
        members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
        role    = "roles/artifactregistry.writer"
      },
      {
        members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
        role    = "roles/cloudbuild.builds.builder"
      },
      {
        members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
        role    = "roles/logging.logWriter"
      }
    ])
  })
  project = google_project.project.project_id
}

These Terraform blocks create a dedicated GCP project with the needed APIs activated, as well as an IAM policy which is assigned to a dedicated service account. Due to our POLP requirement, the service account is only allowed to do the functionalities it needs to. It can create new images in the Artifact Registry, execute Cloud Build runtimes and dump logs.

Moving on to our image creation process:

resource "google_artifact_registry_repository" "dbt_job_image_repository" {
  project       = google_project.project.project_id
  location      = var.gcp_region
  repository_id = "dbt-job-image-repository"
  description   = "Repository for storing the Data Build Tool (dbt) job images"
  format        = "DOCKER"
}

resource "google_cloudbuild_trigger" "dbt_job_image_builder" {
  name        = "build-and-store-dbt-job-image"
  description = "Building the DBT docker image used in cloud run jobs"
  project     = google_project.project.project_id
  location    = var.gcp_region

  # The repository is manually connected to the GCP project in the GCP Console UI due to a authentication limitation.
  github {
    owner = "hunt_&_hackett"
    name  = "data-warehouse-dbt"

    push {
      branch = "^main$"
    }
  }

  build {
    step {
      name = "gcr.io/cloud-builders/docker"
      args = [
        "build",
        "-t",
        "${var.gcp_region}-docker.pkg.dev/${google_project.project.project_id}/${google_artifact_registry_repository.dbt_job_image_repository.name}/data_build_tool_job:latest",
        "."
      ]
    }

    images = [
      "${var.gcp_region}-docker.pkg.dev/${google_project.project.project_id}/${google_artifact_registry_repository.dbt_job_image_repository.name}/data_build_tool_job:latest"
    ]
  }

  service_account = google_service_account.dbt_image_builder.id

  trigger_template {
    branch_name = "main"
  }
}

We create an Artifact Registry to store our dbt job image, and a Cloud Build Trigger resource which builds a new image when a new version is published on the main branch.

Creating the dbt job module

With the process in place to build the dbt image, we can use it in the Cloud Rub Jobs. Also, the jobs should work with our existing data warehouse Terraform configuration - therefore we created a new TF module called dbt_cloud_run_job. The module starts by declaring the necessary local variables:

locals {
    cloud_run_job_environment_variables = {
        "GITHUB_USERNAME" = "data-build-tool"
        "GCP_PROJECT_ID" = var.google_project.id
        "DBT_REVISION" = "0.2.0"
    }
}

The variables are the environment variables in the Cloud Run job. We already configured the environment variables in the dbt repository section. Moving on:

resource "google_service_account" "cloud_run_job_runtime_service_account" {
  project     = var.google_project.id
  account_id  = replace("dbt-job-${var.cloud_run_job_name}", "_", "-")
  description = "Service account to be used for Cloud Scheduler to invoke Cloud Run Jobs'${var.cloud_run_job_name}'"
}

resource "google_service_account" "cloud_run_job_invoker_service_account" {
  project     = var.google_project.id
  account_id  = replace("dbt-invoke-${var.cloud_run_job_name}", "_", "-")
  description = "Service account to be used for Cloud Scheduler to invoke Cloud Run Jobs'${var.cloud_run_job_name}'"
}

data "google_iam_policy" "cloud_scheduler_job_invoker" {
  binding {
    role    = "roles/run.invoker"
    members = [
      "serviceAccount:${google_service_account.cloud_run_job_invoker_service_account.email}"
    ]
  }
}

The second service account we create is the dedicated service account for our Cloud Scheduler. In accordance with our POLP, we create an IAM policy so it can only invoke one dedicated Cloud Run job. With this in place, we create the Cloud Run Job:

resource "google_cloud_run_v2_job" "cloud_run_job" {
  name     = replace(var.cloud_run_job_name, "_", "-")
  project  = var.google_project.id
  location = var.gcp_region

  template {
    template {
      service_account = google_service_account.cloud_run_job_runtime_service_account.email

      containers {
        image = "${var.cloud_run_job_image_url}:latest"
        args  = [var.dbt_model_name]

        dynamic "env" {
          for_each = local.cloud_run_job_environment_variables

          content {
            name  = env.key
            value = env.value
          }
        }
      }
    }
  }

  lifecycle {
    ignore_changes = [
      launch_stage,
    ]
  }
}

The dbt Docker image in the Artifact Registry is provided as a Terraform output of our CI/CD Terraform module and as an input variable for the dbt Cloud Run Job module. Furthermore, the environment variables are dynamically configured in the resource.

The launch_stage of the Job is ignored so the Terraform configuration will not update/revert unnecessarily when the job is in production, making the TF management more predictable and stable. With the Cloud Run Job in place, we can create a specific Cloud Scheduler job for it:

resource "google_cloud_run_v2_job_iam_policy" "cloud_run_job_policy" {
  project     = google_cloud_run_v2_job.cloud_run_job.project
  location    = google_cloud_run_v2_job.cloud_run_job.location
  name        = google_cloud_run_v2_job.cloud_run_job.name
  policy_data = data.google_iam_policy.cloud_scheduler_job_invoker.policy_data
}

locals {
  cloud_run_job_uri = "https://${google_cloud_run_v2_job.cloud_run_job.location}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${var.google_project.number}/jobs/${google_cloud_run_v2_job.cloud_run_job.name}:run"
}

resource "google_cloud_scheduler_job" "cloud_scheduler_job" {
  name        = var.cloud_run_job_name
  project     = var.google_project.id
  schedule    = var.schedule
  region      = coalesce(var.scheduler_region, var.gcp_region)
  description = "trigger ${google_cloud_run_v2_job.cloud_run_job.name}"
  paused      = var.scheduler_is_paused
  time_zone   = var.scheduler_time_zone

  retry_config {
    retry_count = var.scheduler_number_of_retries
  }

  http_target {
    http_method = "POST"
    uri         = local.cloud_run_job_uri

    oauth_token {
      service_account_email = google_service_account.cloud_run_job_invoker_service_account.email
    }
  }

  depends_on = [
    google_cloud_run_v2_job_iam_policy.cloud_run_job_policy, # makes sure the service account can actually invoke the cloud run job
  ]
}

The Cloud Scheduler-specific service account will get the custom IAM policy assigned and is used for the scheduler creation. The Cloud Scheduler is configured based on the input variables provided when declaring the TF module.

Tying it all together

Now that the dbt repositories, CI/CD module, and Cloud Run Job module are in place, we can connect them to our DWH Terraform configuration. Let's review our current Terraform setup:

terraform_project/
├── modules/
|   └── dbt_cloud_run_job/
|   └── data_build_tool_cicd/
|   └── data_warehouse_gcp_project/
|   └── bigquery_dataset/
├── main.tf
├── variables.tf
├── etc.

Both the data_warehouse_gcp_project/ and bigquery_dataset/ modules will handle the creation of the DWH GCP project, its IAM policy, and the management of the BigQuery datasets. To include our dbt Jobs in this setup, we configure the cloud run job(s) in the data_warehouse_gcp_project/ module:

module "cloud_run_job" {
  source = "../dbt_cloud_run_job"
  for_each = var.data_build_tool_jobs
  cloud_run_job_name = each.key
  dbt_model_name = each.value.dbt_model_name
  google_project = {
    id = module.google_project.gcp_project_id
    number = module.google_project.gcp_project_number
  }
  gcp_region = var.gcp_region
  cloud_run_job_image_url = var.dbt_docker_image_url
  schedule = each.value.schedule
  scheduler_region = "europe-west1"
  vault_mount = var.vault_mount
  vault_parameters = var.vault_parameters
  vault_secret_key = "personal_access_token"
}

This module declaration comes after the declaration for the CI/CD project, which provides the var.dbt_docker_image_url variable. The cloud run jobs are created dynamically with the data_build_tool_jobs variable which is based upon the data_build_tool_jobs.yaml file:

jobs:
  job_one:
    dbt_model_name: "raw_to_staging"
    schedule: "0 8 * * *"
    dataset_roles:
      raw: "data_viewer"
      staging: "data_editor"
  job_two:
    dbt_model_name: "staging_enhancement"
    schedule: "10 8 * * *"
    dataset_roles:
      staging: "data_editor"

With the YAML file, we can maintain all the needed dbt model jobs for our BigQuery DWH in one central place. Also, this YAML file will provide the POLP-based IAM roles needed for each BigQuery dataset which the job interacts with, as explained in our 'Solving the Challenge'-section. This granularity is used to manage the exact access roles the service accounts of the dbt Jobs need for the exact BigQuery datasets.

The YAML file concludes our Terraform configuration. The declaration of the dbt jobs ties together all the elements of the proposed design. In production, our Terraform setup enables us to go live relatively quickly with new SQL models by opening two pull requests on the SQL model repository and the YAML file inside the Terraform repository.

Conclusion

Annndd we're done! In conclusion, we successfully implemented a modular dbt setup on Google Cloud, optimizing our SQL scripts for performance, cost-efficiency, and security. By utilizing Cloud Run Jobs instead of a continuously running API server, we significantly reduced operational costs while ensuring that resources are only consumed when needed. The strict enforcement of the principle of least privilege in multiple parts of our infrastructure enhances security, and reduces possible attack vectors of threat actors.

The use of Terraform modules and dbt packages adds flexibility, allowing us to easily update components without disrupting downstream data tables or causing breaking changes. This modularity ensures that our data infrastructure is adaptable to potential new tools and evolving requirements.

Looking ahead, this setup positions us well to scale our datasets with dbt, automate SQL model deployments, and save costs and time on efficieny. Overall, we have achieved the following:

A Terraform-based method to deploy data build tool in our existing data warehouse configuration;
Strict separation of concerns: the CI/CD process, the dbt development cycle, and the DWH project are all managed independently;
All the GCP services and code enforce POLP where necessary;
An easy process to maintain and extend with new dbt SQL models.

Data science