When we started to build a Data Warehouse (DWH) at Hunt & Hackett, the goal was to use it for training Artificial Intelligence (AI) models and empowering comprehensive data analytics. But as any data engineer knows, the path from raw data to AI-ready insights is paved with ETL pipelines and data transformations.
Enter data build tool (dbt) - our SQL-implementation of choice for our AI-based Data Warehouses. But how do you deploy dbt in a cloud environment? In this blog post, we'll dive into how to secure your secrets, how you can implement the principle of least privilege with fine-grained IAM policies, and how you can easily add a new dbt SQL model with one pull request. This is all done using Terraform - no web browser is needed.
To tackle this challenge, the data team devised a data management strategy. Because our precise AI and analytics use cases were not clear yet at the time of development, we settled on the Medallion Architecture to organise our DWH. Initially coined by Databricks, the Medaillon Architecture organises data into a raw section ('bronze'), a filtered/cleaned section ('silver'), and a production section ('gold'). We found this approach of data management both flexible and relatively organised, making it a perfect start for our DWH's data organisation.
To implement such data architecture into our DWH of choice (Google BigQuery), we searched for several solutions. Our main requirement was that the data is manageable with a mainstream programming language, preferably Python or SQL. Also, we wanted to store our source code in Git and either maintain the tool as a container, or as a fully-managed solution.
Based on this, we eventually narrowed down our search into two options: Google Dataform and data build tool (dbt). Both enable an user to use a modularised approach to writing SQL to manage data in both DWHs and traditional RDBMS systems.
Eventually, we settled on dbt. Despite being a managed solution with great integration with BigQuery, Dataform couldn't handle an extensive schema of more than 1.000 columns (which was part of our initial use cases). Dbt could handle large schemas and provided more authentication options. Also, the documentation and community support of dbt were more mature in comparison with Dataform, giving it the last reason to go for dbt as the data management tool for our DWH.
Despite the data management benefits of dbt, deploying it in production environments can be challenging. First, how do you deploy dbt? Since dbt is a Python-based command-line tool, should you use it with an API-like interface? Or, alternatively, should you set up a VPS to host dbt with predefined SQL jobs?
On the other hand, we already had an existing DWH on Google Cloud Platform (GCP). How could we interface dbt to our DWH in a secure way? And how could we do that both efficiently and relatively cheaply? We had the following requirements for our dbt deployment:
During our experiment, running dbt on Cloud Run offered various benefits. It can run containerised front- and backend applications without the need for maintaining its hardware infrastructure. To not run dbt as an 24/7 API (and therefore costing us more money) we opted for the 'Jobs' feature of Cloud Run. Jobs use the same managed-hardware features as a Cloud Run Service, but a Job only runs a defined task and quits when done.
To adhere to our requirements and implement dbt on Cloud Run Jobs, we went back to our literal drawing board. Here we designed a cloud architecture which enables us to develop with dbt locally, and deploy our dbt SQL models as Cloud Run Jobs:
Step-by-step explanation:
All the elements of the solution are build with Terraform where possible.
To limit the exposure of data and to lower the attack surface of the dbt implementation, we designed a POLP-based RBAC for the dbt jobs:
A YAML configuration defines dbt jobs with the following properties:
dbt model name
Cron schedule
BigQuery dataset roles - Per BigQuery dataset, one of the following roles is defined: data_editor, data_viewer, data_owner, or admin.
Terraform loops over the configuration, and creates a Cloud Run Job per dbt model. For every job, a unique service account is created.
In the BigQuery Terraform resource, the respective service accounts are assigned to their dataset roles, which are assigned to the specific dataset via a dataset IAM policy.
The discussion on how to set up a dbt project in a git repository has had lots of debate on your favourite coding forum. As always, the answer is it depends and the developers at dbt Labs have written a very good piece about it. Despite the presented options by dbt, we decided to go for a two-repository approach.
One repository (or 'image' repository) will contain a Dockerfile which containerises the project and in turn installs the other repository as a dbt package. The second repository (or 'model' repository) will function as a regular dbt project, containing all the SQL models and the project configuration.
The reason why we're separating the SQL models from the containerisation is that it allows us to independently manage development cycles and dependencies for both the container and the SQL models. Managing our dbt SQL models in one repository enables focused version control and isolated testing environments for the dbt models themselves.
On the other hand, maintaining an optimised Dockerfile in a separate repository ensures that our image builds and deployment processes can be refined without interfering with SQL model development. This separation enhances overall maintainability, simplifies code updates, and reduces the complexity associated with combined repositories.
In our Terraform configuration, we create both repositories as follows:
resource "github_repository" "github_repo" {
name = var.name
description = var.description
homepage_url = "https://github.com/${var.managed_from_org}/${var.managed_from_repository}"
visibility = var.visibility
vulnerability_alerts = var.vulnerability_alerts
auto_init = var.auto_init
has_downloads = var.has_downloads
has_issues = var.has_issues
has_projects = var.has_projects
has_wiki = var.has_wiki
allow_rebase_merge = var.allow_rebase_merge
allow_squash_merge = var.allow_squash_merge
allow_merge_commit = var.allow_merge_commit
allow_auto_merge = var.allow_auto_merge
allow_update_branch = var.allow_update_branch
squash_merge_commit_title = var.squash_merge_commit_title
squash_merge_commit_message = var.squash_merge_commit_message
delete_branch_on_merge = var.delete_branch_on_merge
archive_on_destroy = true
lifecycle {
ignore_changes = [
allow_squash_merge,
allow_update_branch,
delete_branch_on_merge,
merge_commit_message,
merge_commit_title,
squash_merge_commit_message,
squash_merge_commit_title,
allow_auto_merge,
]
}
}
Also, you can add branch protection and configure further user and team permissions. More on the GitHub Terraform provider can be found here.
With the repositories in place, we can configure them to work with our Terraform design. Let's configure the SQL models repository first.
The dbt SQL model repository setup is rather simple. We start with initialising a Python Poetry project:
poetry init
During the Poetry setup, we install the following package dependency:
`dbt-bigquery = "^1.7.7"`
The dbt-bigquery
dependency will also include the dbt-core
Python package. After the Python package management, we will create our dbt project in our Poetry shell:
dbt init
This will provide us with the dbt file and folder structure. Let's configure the profiles.yml
:
h2_data_warehouse_sql_models:
target: prod
outputs:
prod:
type: bigquery
method: oauth
project: ""
dataset: staging # Fill in the targeted dataset
threads: 5
location: EU
priority: interactive
retries: 2
The only change we added is the GCP_PROJECT_ID
. This is an environment variable which will be set during the runtime of the Cloud Run Job. With the use of this variable, we're able to deploy our SQL models in any containerised dbt environment independent from the Google Cloud project where the data is living. More on the workings of the env_var
function can be found here.
Another change we'll make to the 'vanilla' dbt project is the addition of the properties.yml
file. The file is created inside the 'models' folder. This file has two goals: documenting our SQL models and describing our DWH sources. This is done as follows:
version: 2
sources:
- name: "raw_trainings_data"
project: ""
dataset: "raw_data"
tables:
- name: "source_1"
identifier: "source_1"
- name: "source_2"
identifier: "source_2"
- name: "intermediate_data"
project: ""
dataset: "intermediate_data"
tables:
- name: "table_1"
identifier: "table_1"
- name: "table_2"
identifier: "table_2"
models:
- name: raw_trainings_data.source_1
description: "Unprocessed data from source 1."
columns:
- name: primary_id
description: "A unique identifier for the row."
We add the tables consisting of the unprocessed data (which will function as source data) and add additional documentation to the tables and schemas.
packages:
- git: "https://:@github.com/hnhdev/data-warehouse-dbt-models.git"
revision: ""
- package: dbt-labs/dbt_utils
version: 1.1.1
We use the dbt packages feature to include our SQL model repository in our Docker image repository, which will be the Cloud Run Job. We're importing our dbt SQL project as a Git package, as explained in the dbt documentation. The package configuration uses three environment variables:
GITHUB_USERNAME
- the username of a dedicated 'service account' which provides access to GitHub. This account is a non-human user which is used to control access independent from team fluctuations.
SECRET_TOKEN
- We're using a token to authenticate with the other GitHub repository. This enforces the POLP principle that only one account can access data-warehouse-dbt-models
repository. The token is retrieved from our secret manager when the ‘job’ runs. The dbt project will use the token to authenticate with our SQL model repository and import it as a package.
DBT_REVISION
- the repository commit hash or tag which represents the version of the package.
A small difference with the SQL models repository is the configuration of the dbt_project.yml
file. We do not include anything underneath the models:
part. This should be configured in the SQL models repository.
The Dockerfile will represent the 'job' which essentially runs the dbt run --select {insert dbt model name}
command on GCP's Cloud Run. The Dockerfile is set up as a multi-stage build: a 'builder' part and a 'runtime' part.
FROM python:3.11-slim-buster AS builder
ARG POETRY_VERSION=1.6.1
ENV \
# Python doesn't create .pyc files.
PYTHONDONTWRITEBYTECODE=1 \
# Python prints application output/logs to the terminal.
PYTHONUNBUFFERED=1 \
# Python prints tracebacks of error to the terminal.
PYTHONFAULTHANDLER=1 \
# Turn off pip's cache.
PIP_NO_CACHE_DIR=1 \
# Disable pip's version check to fasten the build.
PIP_DISABLE_PIP_VERSION_CHECK=1 \
# Increase the timeout seconds for slow package downloads.
PIP_DEFAULT_TIMEOUT=100 \
# Poetry will not ask config questions.
POETRY_NO_INTERACTION=1 \
# Create Poetry virtualenv which mimics the local virtualenv
# and is a caching strategy.
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=1 \
# Specify Poetry's caching directory.
POETRY_CACHE_DIR=/tmp/poetry_cache
COPY pyproject.toml poetry.lock ./
RUN pip install poetry==${POETRY_VERSION} && \
poetry install && \
rm -rf $POETRY_CACHE_DIR
FROM python:3.11-slim-buster as runtime
ENV VIRTUAL_ENV=/.venv \
PATH="/.venv/bin:$PATH" \
DBT_ENV_SECRET_PAT=""
RUN apt-get -y update && \
apt-get install --no-install-recommends --yes git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
useradd -r dbt
WORKDIR /h2_data_warehouse
RUN chown dbt /h2_data_warehouse
COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
COPY --chown=dbt h2_data_warehouse vault.py /h2_data_warehouse/
USER dbt
ENTRYPOINT dbt deps && dbt run
CMD "--select"
The 'builder' part will create a virtual environment with Poetry and installs all the necessary dependencies. The environment variables seen in the 'builder' part are used to optimise the Poetry virtual environment (see the accompanying comments). A similar thing happens when the
rm -rf $POETRY_CACHE_DIR
command is run, which will delete the unnecessary cache of the virtual environment.USER
step will change the context where subsequent commands are run. This starts with the ENTRYPOINT
. The ENTRYPOINT
installs the SQL model project as a dbt package with dbt deps
. The image ends with the dbt run
command, followed by the --select
flag, enabling the selection of a specific dbt SQL model.ENTRYPOINT
and CMD
instructions in the Dockerfile, we achieve two goals: we can reproduce and run our dbt 'job' everywhere and do that in line with our POLP principles. The last measures to secure our dbt setup are done in our Terraform setup.
To build the Dockerfile and use it as a container on Cloud Run Jobs, we are going to setup a CI/CD project. A dedicated CI/CD GCP project enables us to control the development and maintenance cycles of the Docker image repository without the interference of the SQL model development cycles.
In our Terraform code, we create a new folder called data_build_tool_cicd
(which will function as a Terraform module) and a new main.tf
file. We initialise the following:
locals {
gcp_project_apis = [
"cloudbuild.googleapis.com",
"artifactregistry.googleapis.com",
"iam.googleapis.com"
]
}
resource "google_project" "project" {
name = var.name
project_id = var.project_id
org_id = var.org_id
folder_id = var.folder_id
billing_account = var.billing_account
deletion_policy = var.deletion_policy
labels = var.labels
auto_create_network = var.auto_create_network
}
resource "google_project_service" "activate_apis" {
for_each = toset(local.gcp_project_apis)
project = google_project.project.project_id
service = each.value
disable_on_destroy = false
disable_dependent_services = false
}
resource "google_service_account" "dbt_image_builder" {
project = google_project.project.project_id
account_id = "dbt-image-builder"
description = "Service account to build the DBT image within the Cloud Build CI/CD"
}
resource "google_project_iam_policy" "project_policy" {
policy_data = jsonencode({
bindings = concat(var.global_project_iam_members, [
{
members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
role = "roles/artifactregistry.writer"
},
{
members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
role = "roles/cloudbuild.builds.builder"
},
{
members = ["serviceAccount:${google_service_account.dbt_image_builder.email}"]
role = "roles/logging.logWriter"
}
])
})
project = google_project.project.project_id
}
These Terraform blocks create a dedicated GCP project with the needed APIs activated, as well as an IAM policy which is assigned to a dedicated service account. Due to our POLP requirement, the service account is only allowed to do the functionalities it needs to. It can create new images in the Artifact Registry, execute Cloud Build runtimes and dump logs.
Moving on to our image creation process:
resource "google_artifact_registry_repository" "dbt_job_image_repository" {
project = google_project.project.project_id
location = var.gcp_region
repository_id = "dbt-job-image-repository"
description = "Repository for storing the Data Build Tool (dbt) job images"
format = "DOCKER"
}
resource "google_cloudbuild_trigger" "dbt_job_image_builder" {
name = "build-and-store-dbt-job-image"
description = "Building the DBT docker image used in cloud run jobs"
project = google_project.project.project_id
location = var.gcp_region
# The repository is manually connected to the GCP project in the GCP Console UI due to a authentication limitation.
github {
owner = "hunt_&_hackett"
name = "data-warehouse-dbt"
push {
branch = "^main$"
}
}
build {
step {
name = "gcr.io/cloud-builders/docker"
args = [
"build",
"-t",
"${var.gcp_region}-docker.pkg.dev/${google_project.project.project_id}/${google_artifact_registry_repository.dbt_job_image_repository.name}/data_build_tool_job:latest",
"."
]
}
images = [
"${var.gcp_region}-docker.pkg.dev/${google_project.project.project_id}/${google_artifact_registry_repository.dbt_job_image_repository.name}/data_build_tool_job:latest"
]
}
service_account = google_service_account.dbt_image_builder.id
trigger_template {
branch_name = "main"
}
}
We create an Artifact Registry to store our dbt job image, and a Cloud Build Trigger resource which builds a new image when a new version is published on the main branch.
With the process in place to build the dbt image, we can use it in the Cloud Rub Jobs. Also, the jobs should work with our existing data warehouse Terraform configuration - therefore we created a new TF module called dbt_cloud_run_job
. The module starts by declaring the necessary local variables:
locals {
cloud_run_job_environment_variables = {
"GITHUB_USERNAME" = "data-build-tool"
"GCP_PROJECT_ID" = var.google_project.id
"DBT_REVISION" = "0.2.0"
}
}
The variables are the environment variables in the Cloud Run job. We already configured the environment variables in the dbt repository section. Moving on:
resource "google_service_account" "cloud_run_job_runtime_service_account" {
project = var.google_project.id
account_id = replace("dbt-job-${var.cloud_run_job_name}", "_", "-")
description = "Service account to be used for Cloud Scheduler to invoke Cloud Run Jobs'${var.cloud_run_job_name}'"
}
resource "google_service_account" "cloud_run_job_invoker_service_account" {
project = var.google_project.id
account_id = replace("dbt-invoke-${var.cloud_run_job_name}", "_", "-")
description = "Service account to be used for Cloud Scheduler to invoke Cloud Run Jobs'${var.cloud_run_job_name}'"
}
data "google_iam_policy" "cloud_scheduler_job_invoker" {
binding {
role = "roles/run.invoker"
members = [
"serviceAccount:${google_service_account.cloud_run_job_invoker_service_account.email}"
]
}
}
The second service account we create is the dedicated service account for our Cloud Scheduler. In accordance with our POLP, we create an IAM policy so it can only invoke one dedicated Cloud Run job. With this in place, we create the Cloud Run Job:
resource "google_cloud_run_v2_job" "cloud_run_job" {
name = replace(var.cloud_run_job_name, "_", "-")
project = var.google_project.id
location = var.gcp_region
template {
template {
service_account = google_service_account.cloud_run_job_runtime_service_account.email
containers {
image = "${var.cloud_run_job_image_url}:latest"
args = [var.dbt_model_name]
dynamic "env" {
for_each = local.cloud_run_job_environment_variables
content {
name = env.key
value = env.value
}
}
}
}
}
lifecycle {
ignore_changes = [
launch_stage,
]
}
}
The dbt Docker image in the Artifact Registry is provided as a Terraform output of our CI/CD Terraform module and as an input variable for the dbt Cloud Run Job module. Furthermore, the environment variables are dynamically configured in the resource.
The launch_stage
of the Job is ignored so the Terraform configuration will not update/revert unnecessarily when the job is in production, making the TF management more predictable and stable. With the Cloud Run Job in place, we can create a specific Cloud Scheduler job for it:
resource "google_cloud_run_v2_job_iam_policy" "cloud_run_job_policy" {
project = google_cloud_run_v2_job.cloud_run_job.project
location = google_cloud_run_v2_job.cloud_run_job.location
name = google_cloud_run_v2_job.cloud_run_job.name
policy_data = data.google_iam_policy.cloud_scheduler_job_invoker.policy_data
}
locals {
cloud_run_job_uri = "https://${google_cloud_run_v2_job.cloud_run_job.location}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${var.google_project.number}/jobs/${google_cloud_run_v2_job.cloud_run_job.name}:run"
}
resource "google_cloud_scheduler_job" "cloud_scheduler_job" {
name = var.cloud_run_job_name
project = var.google_project.id
schedule = var.schedule
region = coalesce(var.scheduler_region, var.gcp_region)
description = "trigger ${google_cloud_run_v2_job.cloud_run_job.name}"
paused = var.scheduler_is_paused
time_zone = var.scheduler_time_zone
retry_config {
retry_count = var.scheduler_number_of_retries
}
http_target {
http_method = "POST"
uri = local.cloud_run_job_uri
oauth_token {
service_account_email = google_service_account.cloud_run_job_invoker_service_account.email
}
}
depends_on = [
google_cloud_run_v2_job_iam_policy.cloud_run_job_policy, # makes sure the service account can actually invoke the cloud run job
]
}
The Cloud Scheduler-specific service account will get the custom IAM policy assigned and is used for the scheduler creation. The Cloud Scheduler is configured based on the input variables provided when declaring the TF module.
Now that the dbt repositories, CI/CD module, and Cloud Run Job module are in place, we can connect them to our DWH Terraform configuration. Let's review our current Terraform setup:
terraform_project/
├── modules/
| └── dbt_cloud_run_job/
| └── data_build_tool_cicd/
| └── data_warehouse_gcp_project/
| └── bigquery_dataset/
├── main.tf
├── variables.tf
├── etc.
Both the data_warehouse_gcp_project/
and bigquery_dataset/
modules will handle the creation of the DWH GCP project, its IAM policy, and the management of the BigQuery datasets. To include our dbt Jobs in this setup, we configure the cloud run job(s) in the data_warehouse_gcp_project/
module:
module "cloud_run_job" {
source = "../dbt_cloud_run_job"
for_each = var.data_build_tool_jobs
cloud_run_job_name = each.key
dbt_model_name = each.value.dbt_model_name
google_project = {
id = module.google_project.gcp_project_id
number = module.google_project.gcp_project_number
}
gcp_region = var.gcp_region
cloud_run_job_image_url = var.dbt_docker_image_url
schedule = each.value.schedule
scheduler_region = "europe-west1"
vault_mount = var.vault_mount
vault_parameters = var.vault_parameters
vault_secret_key = "personal_access_token"
}
This module declaration comes after the declaration for the CI/CD project, which provides the var.dbt_docker_image_url
variable. The cloud run jobs are created dynamically with the data_build_tool_jobs
variable which is based upon the data_build_tool_jobs.yaml
file:
jobs:
job_one:
dbt_model_name: "raw_to_staging"
schedule: "0 8 * * *"
dataset_roles:
raw: "data_viewer"
staging: "data_editor"
job_two:
dbt_model_name: "staging_enhancement"
schedule: "10 8 * * *"
dataset_roles:
staging: "data_editor"
With the YAML file, we can maintain all the needed dbt model jobs for our BigQuery DWH in one central place. Also, this YAML file will provide the POLP-based IAM roles needed for each BigQuery dataset which the job interacts with, as explained in our 'Solving the Challenge'-section. This granularity is used to manage the exact access roles the service accounts of the dbt Jobs need for the exact BigQuery datasets.
The YAML file concludes our Terraform configuration. The declaration of the dbt jobs ties together all the elements of the proposed design. In production, our Terraform setup enables us to go live relatively quickly with new SQL models by opening two pull requests on the SQL model repository and the YAML file inside the Terraform repository.
Annndd we're done! In conclusion, we successfully implemented a modular dbt setup on Google Cloud, optimizing our SQL scripts for performance, cost-efficiency, and security. By utilizing Cloud Run Jobs instead of a continuously running API server, we significantly reduced operational costs while ensuring that resources are only consumed when needed. The strict enforcement of the principle of least privilege in multiple parts of our infrastructure enhances security, and reduces possible attack vectors of threat actors.
The use of Terraform modules and dbt packages adds flexibility, allowing us to easily update components without disrupting downstream data tables or causing breaking changes. This modularity ensures that our data infrastructure is adaptable to potential new tools and evolving requirements.
Looking ahead, this setup positions us well to scale our datasets with dbt, automate SQL model deployments, and save costs and time on efficieny. Overall, we have achieved the following:
A Terraform-based method to deploy data build tool in our existing data warehouse configuration;
Strict separation of concerns: the CI/CD process, the dbt development cycle, and the DWH project are all managed independently;
All the GCP services and code enforce POLP where necessary;
An easy process to maintain and extend with new dbt SQL models.