dbt-corearchitectureproduction

Building a Data Platform on GCP: Dataflow + BigQuery + dbt + Airflow + Looker

How to build an end-to-end data platform on Google Cloud with Dataflow for ingestion, BigQuery for storage, dbt for transformation, Airflow for orchestration, and Looker for BI.

ModelDock TeamFebruary 17, 202614 min read

If you're building a data platform on Google Cloud, you've probably noticed the ecosystem has a clear "happy path." BigQuery for storage and compute, Looker for BI, and a constellation of tools that integrate tightly with both. It's one of the more cohesive cloud data stacks you can put together — assuming you know which pieces to pick and how they fit.

This guide walks through a production-grade GCP data stack: Cloud Dataflow for ingestion, BigQuery for storage, dbt for transformation, Airflow for orchestration, and Looker for analytics. We'll cover why each component earns its spot, how data flows through the system, and the GCP-specific gotchas that will save you time (and money).

Why Build on GCP?

Every cloud provider has a data warehouse. What makes GCP different is BigQuery's architecture and how tightly the rest of the ecosystem connects to it.

Serverless by default. BigQuery has no clusters to provision, no nodes to scale, no infrastructure to manage. You create a dataset, write SQL, and Google handles the rest. For teams that want to focus on data modeling instead of cluster tuning, this is a real advantage over Redshift or Databricks.

Separation of storage and compute. You pay for storage and queries independently. Store terabytes cheaply and only pay compute costs when you actually run queries. This matters more than it sounds — it means you can keep years of historical data without worrying about idle compute costs.

Native Looker integration. Google acquired Looker and has been weaving it into the GCP fabric. Looker connects to BigQuery with zero friction, and LookML's semantic layer approach pairs well with dbt's transformation philosophy. If BI is a core requirement, the BigQuery + Looker combination is hard to beat.

Strong ML and AI integration. BigQuery ML lets you train models directly in SQL. Vertex AI connects natively. If your data platform might grow into ML territory, the GCP ecosystem has the smoothest on-ramp.

That said, GCP isn't perfect. The pricing model has surprises (we'll get to that), Cloud Composer is expensive, and some tooling is less mature than AWS equivalents. Let's look at the full stack.

The Stack Overview

Here's what each component does and why it's in the stack.

Cloud Dataflow (Ingestion)

Cloud Dataflow is Google's fully managed data processing service, built on Apache Beam. It handles both batch and streaming data pipelines, making it ideal for ingesting data from databases, APIs, Pub/Sub topics, Cloud Storage, and other sources into BigQuery.

Why Dataflow? It's GCP-native, serverless, and scales automatically. You define your pipeline logic using the Apache Beam SDK (Python or Java), and Dataflow handles provisioning, autoscaling, and monitoring. There's no infrastructure to manage — you submit a job and Google runs it.

Dataflow is particularly strong for:

Database replication: Use Dataflow templates to stream changes from Cloud SQL, MySQL, or PostgreSQL into BigQuery via change data capture (CDC)
File ingestion: Process CSV, JSON, Avro, or Parquet files landing in Cloud Storage and load them into BigQuery
Streaming ingestion: Read from Pub/Sub topics and write to BigQuery in near-real-time
API ingestion: Write custom Beam pipelines to pull data from REST APIs and land it in BigQuery

Google provides a library of pre-built Dataflow templates for common use cases (Cloud Storage to BigQuery, Pub/Sub to BigQuery, JDBC to BigQuery), so you don't always need to write pipeline code from scratch.

Dataflow lands data in a "raw" BigQuery dataset — untransformed, source-mirrored tables that dbt will pick up downstream.

BigQuery (Storage and Compute)

BigQuery is the center of this stack. It stores your raw data, runs your dbt transformations, and serves queries to Looker. The key architectural points:

On-demand pricing: $6.25 per TB scanned. Good for development and smaller workloads.
Flat-rate slots: Reserved compute capacity at a fixed monthly cost. Better for production workloads with predictable query volumes.
Partitioning and clustering: Critical for controlling costs and query performance. Partition by date, cluster by your most-filtered columns.
Datasets as schemas: BigQuery uses "datasets" where other warehouses use "schemas." Your data platform will typically have raw, staging, and marts datasets.

dbt (Transformation)

dbt handles the "T" in ELT. It takes the raw data that Dataflow landed in BigQuery, applies SQL transformations organized in a dependency graph, and produces clean, tested, documented tables and views following the medallion architecture:

Bronze (staging) models clean and rename raw Dataflow tables
Silver (intermediate) models join and aggregate across sources
Gold (marts) models serve specific business domains (these are what Looker connects to)

dbt is particularly strong on BigQuery because of native support for partitioning, clustering, merge strategies for incremental models, and BigQuery-specific materializations. More on this in the setup section.

Airflow (Orchestration)

Airflow coordinates the pipeline. It triggers Dataflow jobs, waits for completion, kicks off dbt runs, and handles retries and alerting. Without an orchestrator, you're running things manually or hoping cron jobs don't silently fail.

On GCP, you have two options for Airflow:

Cloud Composer: Google's managed Airflow service. Zero infrastructure management, but expensive ($300+/month minimum for the smallest environment).
Self-hosted: Run Airflow on GKE or Compute Engine. More control, lower cost at scale, but significant operational overhead.

Looker (BI and Analytics)

Looker is Google's BI platform. Unlike traditional BI tools that connect directly to tables, Looker uses LookML — a modeling language that defines a semantic layer on top of your data. This pairs naturally with dbt's approach: dbt builds clean Gold (mart) tables, and LookML defines how business users explore them.

Looker connects natively to BigQuery, supports BigQuery's nested and repeated fields, and can push down complex queries without performance hacks. If you're already on GCP, it's the most integrated BI option available.

How Data Flows

The pipeline follows a clear path:

Dataflow → BigQuery (raw) → dbt (Bronze → Silver → Gold) → Looker

In more detail:

Dataflow ingests source data into the raw dataset in BigQuery. Templates or custom Beam pipelines handle extraction from databases, Cloud Storage, Pub/Sub, and APIs.
dbt Bronze (staging) models clean and rename raw data. One staging model per source table. Light transformations: type casting, renaming, filtering deleted records.
dbt Silver (intermediate) models join and aggregate Bronze models. Business logic lives here.
dbt Gold (marts) models are the final, business-ready tables. One mart per domain or team: marts.finance_revenue, marts.marketing_attribution, etc.
Looker connects to the Gold dataset. LookML explores map to Gold tables, and business users build dashboards and reports.

Airflow orchestrates steps 1-4: trigger Dataflow job, wait for completion, run dbt build.

Setting Up the Pipeline

BigQuery Dataset Structure

Create three datasets to separate your data layers:

# Raw data from Dataflow
bq mk --dataset --location=US your-project:raw

# dbt Bronze and Silver models
bq mk --dataset --location=US your-project:staging

# dbt Gold models (final business tables)
bq mk --dataset --location=US your-project:marts

Make sure all datasets are in the same region. Cross-region queries in BigQuery either fail or incur egress charges.

dbt Project Configuration

Your profiles.yml connects dbt to BigQuery using a service account:

# ~/.dbt/profiles.yml
my_gcp_project:
  target: prod
  outputs:
    prod:
      type: bigquery
      method: service-account
      project: your-gcp-project-id
      dataset: staging  # default dataset for models
      threads: 8
      keyfile: /path/to/service-account-key.json
      location: US
      timeout_seconds: 300
      priority: interactive
      retries: 1

For production environments (CI/CD, Docker, orchestrators), use the service-account-json method with environment variables instead of a key file on disk:

my_gcp_project:
  target: prod
  outputs:
    prod:
      type: bigquery
      method: service-account-json
      project: "{{ env_var('GCP_PROJECT_ID') }}"
      dataset: staging
      threads: 8
      location: US
      timeout_seconds: 300
      keyfile_json:
        type: service_account
        project_id: "{{ env_var('GCP_PROJECT_ID') }}"
        private_key_id: "{{ env_var('GCP_PRIVATE_KEY_ID') }}"
        private_key: "{{ env_var('GCP_PRIVATE_KEY') }}"
        client_email: "{{ env_var('GCP_CLIENT_EMAIL') }}"
        client_id: "{{ env_var('GCP_CLIENT_ID') }}"
        auth_uri: https://accounts.google.com/o/oauth2/auth
        token_uri: https://oauth2.googleapis.com/token
        auth_provider_x509_cert_url: https://www.googleapis.com/oauth2/v1/certs
        client_x509_cert_url: "{{ env_var('GCP_CERT_URL') }}"

dbt Model with BigQuery-Specific Config

BigQuery has features that other warehouses don't — partitioning and clustering are the big ones. Here's a Gold model that takes advantage of both:

-- models/marts/fct_orders.sql
{{
  config(
    materialized='incremental',
    partition_by={
      "field": "order_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=["customer_id", "order_status"],
    incremental_strategy='merge',
    unique_key='order_id'
  )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

payments as (
    select * from {{ ref('stg_payments') }}
)

select
    o.order_id,
    o.customer_id,
    o.order_date,
    o.order_status,
    p.total_amount,
    p.payment_method,
    current_timestamp() as loaded_at
from orders o
left join payments p on o.order_id = p.order_id

{% if is_incremental() %}
where o.order_date >= (select max(order_date) from {{ this }})
{% endif %}

The partition_by config tells BigQuery to physically partition the table by order_date. This is critical for cost control — queries that filter on order_date only scan the relevant partitions instead of the entire table. The cluster_by config sorts data within partitions by customer_id and order_status, improving query performance for filters on those columns.

The incremental_strategy='merge' uses BigQuery's native MERGE statement, which is the most efficient approach for upserts on partitioned tables.

Directing Models to Different Datasets

In your dbt_project.yml, route models to the correct BigQuery datasets:

# dbt_project.yml
models:
  my_gcp_project:
    staging:
      +schema: staging
      +materialized: view
    intermediate:
      +schema: staging
      +materialized: view
    marts:
      +schema: marts
      +materialized: table

Airflow DAG Example

Here's a simplified DAG that coordinates Dataflow and dbt. This uses the Dataflow Airflow provider to launch a template job and a BashOperator for dbt:

# dags/gcp_data_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.google.cloud.operators.dataflow import (
    DataflowStartFlexTemplateOperator,
)
from airflow.providers.google.cloud.sensors.dataflow import (
    DataflowJobStatusSensor,
)
from datetime import datetime, timedelta

default_args = {
    "owner": "data-team",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "email": ["data-team@company.com"],
}

with DAG(
    "gcp_data_pipeline",
    default_args=default_args,
    schedule_interval="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    # Trigger Dataflow ingestion job
    run_dataflow = DataflowStartFlexTemplateOperator(
        task_id="run_dataflow_ingestion",
        project_id="your-gcp-project-id",
        location="us-central1",
        body={
            "launchParameter": {
                "jobName": "ingest-source-data",
                "containerSpecGcsPath": (
                    "gs://your-bucket/templates/ingestion-template.json"
                ),
                "parameters": {
                    "outputTable": "your-project:raw.source_table",
                },
            }
        },
        wait_until_finished=True,
    )

    # Run dbt
    dbt_build = BashOperator(
        task_id="dbt_build",
        bash_command="cd /opt/dbt/my-project && dbt build --target prod",
    )

    run_dataflow >> dbt_build

This DAG runs daily at 6 AM: launch the Dataflow ingestion job, wait for it to complete, then run dbt build. In practice, you'd likely have multiple Dataflow jobs for different sources, add Slack notifications on failure, and set SLA timeouts.

GCP-Specific Considerations

BigQuery Pricing: On-Demand vs. Flat-Rate Slots

This is the question every BigQuery team hits eventually.

On-demand pricing charges $6.25 per TB scanned. It's simple, requires no commitment, and works well when your query volume is low or unpredictable. But it has a catch: costs scale with data volume, not query complexity. A poorly written query that scans a 10 TB unpartitioned table costs $62.50 every time it runs.

Flat-rate slots give you reserved compute capacity (100 slots minimum, roughly $2,000/month for on-demand reservations). Costs are fixed regardless of data scanned. This makes costs predictable and eliminates the "someone ran a bad query and burned through our budget" problem.

When to switch: If your monthly on-demand bill consistently exceeds $2,000-3,000, flat-rate slots are probably cheaper. If you're running dbt on a schedule with large incremental models, the predictability alone might be worth it. Start on-demand, monitor with BigQuery's INFORMATION_SCHEMA.JOBS view, and switch when the numbers make sense.

Service Account Permissions and Workload Identity

For production pipelines, your dbt runner needs a service account with:

roles/bigquery.dataEditor on your target datasets (staging, marts)
roles/bigquery.dataViewer on source datasets (raw)
roles/bigquery.jobUser at the project level

For Dataflow jobs, the service account also needs:

roles/dataflow.worker at the project level
roles/storage.objectViewer on any Cloud Storage buckets used for templates or staging

If you're running on GKE (including Cloud Composer), use Workload Identity Federation instead of JSON key files. It maps Kubernetes service accounts to GCP service accounts without long-lived credentials — no key rotation, no leaked secrets.

# Link a Kubernetes service account to a GCP service account
gcloud iam service-accounts add-iam-policy-binding \
  dbt-runner@your-project.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:your-project.svc.id.goog[airflow/dbt-runner]"

Dataset Locations

BigQuery datasets are region-locked. All datasets in your pipeline must be in the same region, or queries across them will fail. Pick a region early — US (multi-region) for most North American teams, EU (multi-region) for GDPR compliance — and stick with it.

Make sure your Dataflow jobs write to BigQuery datasets in the same region. A location mismatch between the Dataflow job region and the BigQuery dataset region will cause failures.

BigQuery-Specific dbt Features

A few things that work differently (or better) on BigQuery:

Merge strategy for incremental models: BigQuery's MERGE is efficient but has a quirk — the merge key must not produce duplicate matches on the target table, or the query fails. Make sure your unique_key is actually unique.
Partition pruning: dbt's is_incremental() filter should align with your partition column. If your table is partitioned by order_date but your incremental filter is on updated_at, BigQuery still scans all partitions.
Require partition filter: You can set require_partition_filter: true on BigQuery tables to prevent full-table scans. Useful for large tables that should always be queried with a date filter.
Nested and repeated fields: BigQuery supports STRUCT and ARRAY types natively. dbt can work with these, but they complicate your models. Flatten them in Bronze (staging) if downstream tools (including Looker) expect flat schemas.

Common Pitfalls

BigQuery slot contention. On flat-rate pricing, all queries share the same pool of slots. If your dbt run uses 8 threads and each thread runs a heavy query, you might starve out Looker dashboards or ad-hoc analyst queries running at the same time. Use BigQuery Reservations to assign dedicated slot pools for different workloads (e.g., "dbt-production" gets 200 slots, "analytics" gets 100 slots).

Scan costs on unpartitioned tables. This is the number-one BigQuery cost trap. A single unpartitioned table with a few TB of data will cost you $6+ per query on on-demand pricing. Always partition fact tables by date and cluster by frequently-filtered dimensions. Add require_partition_filter: true for extra protection.

Looker LookML learning curve. LookML is powerful but it's its own language with its own IDE (Looker IDE). Data engineers comfortable with dbt often find the transition jarring. Plan for a ramp-up period, and consider having your dbt and LookML models mirror each other in structure — one LookML view per dbt Gold model.

Cloud Composer pricing. Google's managed Airflow service starts at roughly $300-400/month for the smallest composer-1 environment, and a production-grade composer-2 environment easily runs $500-1,000/month. That's before you add workers for heavier workloads. For teams that only need Airflow to run dbt, this is hard to justify.

Dataflow cold starts and costs. Dataflow jobs have a startup time of 2-5 minutes while workers are provisioned. For small, frequent jobs this overhead adds up. Consider using Dataflow Prime for faster autoscaling, or batch multiple small sources into a single pipeline. Dataflow charges per vCPU-hour and per GB of memory, so right-size your worker machine types and set maxNumWorkers to prevent runaway costs.

The Orchestration Challenge

Every component in this stack is either fully managed or straightforward to set up — except orchestration.

Dataflow is serverless. BigQuery is serverless. Looker is a SaaS platform. dbt is just SQL files in a Git repo. But something needs to coordinate the pipeline: trigger Dataflow jobs, run transformations on a schedule, handle failures, and alert when things break.

Cloud Composer solves this, but the cost is steep. At $300-400/month minimum — and realistically $500+ for a production setup — it's often the most expensive line item in the stack for small to mid-size teams. And that's Google's managed version. Self-hosting Airflow on GKE means managing the scheduler, webserver, workers, metadata database, and keeping everything updated.

For teams whose primary orchestration need is running dbt on a schedule after Dataflow ingestion, a full Airflow deployment feels disproportionate.

A Simpler Way to Handle Orchestration

We built ModelDock to solve exactly this problem. It runs Airflow under the hood — same reliability, same battle-tested scheduler — but you never touch it.

Connect your dbt Git repository, enter your BigQuery service account credentials (encrypted with AES-256-GCM), set a cron schedule, and you're running in production. Your dbt project executes in an isolated container with full run logs and artifact storage. If a run fails, you know about it.

No Cloud Composer bill. No Airflow infrastructure to manage. No DAGs to write.

ModelDock handles the orchestration layer so you can focus on the parts of the stack that actually differentiate your data platform — the models, the business logic, and the dashboards your team relies on.

Free during open beta at modeldock.run.