dbt-corearchitectureproduction

Building a Data Platform on AWS: Glue + Redshift + dbt + Airflow + QuickSight

How to build an end-to-end data platform on AWS with Glue for ingestion, Redshift for storage, dbt for transformation, Airflow for orchestration, and QuickSight for BI.

ModelDock TeamFebruary 17, 202614 min read

If you work in data, chances are your company already has an AWS account. Probably several. AWS is where most enterprises already live, and when it comes time to build an aws data platform, staying in the ecosystem makes a lot of practical sense. IAM roles flow naturally, VPC networking is already configured, and procurement has already approved the vendor.

That doesn't mean building an aws data pipeline architecture on AWS is simple. The number of services available is overwhelming, and it's easy to over-engineer things or pick the wrong tool for the job. This guide lays out a proven, production-ready aws elt stack using five components: AWS Glue for ingestion, Amazon Redshift for storage, dbt for transformation, Apache Airflow for orchestration, and Amazon QuickSight for BI.

Let's walk through each layer, how data flows between them, and where things get tricky.

Why Build on AWS?

Three reasons stand out:

You're already there. Most enterprises have AWS accounts with established billing, networking, and security controls. Adding a data platform to an existing AWS footprint is far easier than introducing a new cloud provider.

The ecosystem is deep. AWS has services for every part of the data stack. Whether you need streaming (Kinesis), object storage (S3), or managed databases (RDS), there's a native option. More importantly, these services integrate with each other through IAM, VPCs, and CloudWatch without requiring third-party glue.

Talent pool. AWS is the most widely used cloud provider. Your next hire is more likely to know Redshift than they are to know Synapse or AlloyDB. That matters when you're building a team.

The trade-off is cost complexity. AWS pricing is notoriously difficult to predict, and a data platform touches multiple services that each bill differently. We'll cover that later.

The Stack Overview

Here's what each component does and why it's in the stack.

AWS Glue / DMS (Ingestion)

AWS Glue is a serverless ETL service that handles data ingestion. You write Python or Spark-based jobs that extract data from sources (S3, RDS, APIs, JDBC databases) and load it into your warehouse.

Glue also includes crawlers -- automated processes that scan your data sources, infer schemas, and populate the Glue Data Catalog. This is genuinely useful when you're working with semi-structured data in S3 (JSON, Parquet, CSV). The crawler figures out column names, types, and partitions so you don't have to define everything manually.

For database-to-database replication, AWS DMS (Database Migration Service) is often a better fit. Despite the name, DMS isn't just for migrations -- it supports continuous replication (CDC) from sources like MySQL, PostgreSQL, Oracle, and SQL Server into Redshift. If your primary use case is syncing operational databases into your warehouse, DMS is simpler than writing Glue jobs.

When to use which:

Glue: S3-based ingestion, complex transformations during load, multiple heterogeneous sources
DMS: Database replication with change data capture, simpler setup for database-to-database pipelines

Amazon Redshift (Storage & Compute)

Redshift is AWS's columnar data warehouse. It's been around since 2012 and is one of the most mature cloud warehouses available. Your two main deployment options:

Redshift Serverless -- You specify a base capacity in RPUs (Redshift Processing Units), and Redshift scales compute automatically. No cluster management, no node selection. You pay for the compute you use (measured in RPU-hours) plus storage. This is the right choice for most new projects.

Provisioned clusters -- You pick node types (RA3, DC2) and cluster sizes. RA3 nodes separate compute from storage (data lives in S3-backed managed storage), while DC2 nodes use local SSD storage. Provisioned clusters make sense if you have predictable, sustained workloads and want to commit to reserved instances for cost savings.

For a redshift dbt setup, Redshift Serverless is almost always the better starting point. It eliminates the capacity planning guesswork and lets you focus on building models instead of sizing clusters.

dbt (Transformation)

dbt handles the "T" in ELT. Once raw data lands in Redshift (via Glue or DMS), dbt transforms it into clean, tested, documented models using SQL.

The value proposition is the same regardless of your warehouse: version-controlled SQL, automated testing, lineage tracking, and a clean separation between raw data and business logic. With Redshift specifically, dbt handles materializations (tables, views, incremental models) and can leverage Redshift-specific features like sort keys and distribution keys through model configs.

Airflow (Orchestration)

Something needs to coordinate all of this. Glue jobs need to finish before dbt runs. dbt needs to finish before QuickSight refreshes. That's where Airflow comes in.

On AWS, you have two options:

Amazon MWAA (Managed Workflows for Apache Airflow) -- AWS runs Airflow for you. You get a managed Airflow environment with the webserver, scheduler, and workers handled by AWS. You just deploy DAGs to an S3 bucket. The downside is cost: MWAA starts at roughly $0.49/hour for the smallest environment, which works out to about $350/month before you run a single DAG.

Self-hosted Airflow -- Run Airflow yourself on EC2, ECS, or EKS. Full control, lower base cost, but you're responsible for everything: upgrades, scaling, monitoring, high availability. This is a serious operational commitment.

Amazon QuickSight (BI)

QuickSight is AWS's native BI tool. It connects directly to Redshift, has a built-in caching engine called SPICE (Super-fast, Parallel, In-memory Calculation Engine), and uses pay-per-session pricing for readers.

The main advantages are tight AWS integration (IAM auth, VPC connectivity) and a pricing model that works well if you have many casual users who look at dashboards occasionally. The main disadvantage is that QuickSight is less feature-rich than Looker, Tableau, or Power BI. For many teams, though, it's good enough -- and being native to AWS simplifies networking and security significantly.

How Data Flows Through the Stack

Here's the end-to-end flow:

S3 Landing Zone
    |
    v
AWS Glue Job (or DMS)
    |  Extracts, cleans, loads raw data
    v
Redshift: raw schema
    |
    v
dbt: Bronze (staging) models
    |  Cleans, renames, casts types
    v
dbt: Silver (intermediate) models
    |  Business logic, joins, aggregations
    v
dbt: Gold (mart) models
    |  Final tables for consumption
    v
QuickSight Dashboards
    |  SPICE cache or direct query
    v
Business Users

The key principle is that raw data lands untransformed in Redshift. All business logic lives in dbt, where it's version controlled and tested. QuickSight reads only from Gold (mart) models — never from raw tables.

This separation means you can change business logic by modifying SQL in your dbt project, run dbt build, and the dashboards update automatically (after a SPICE refresh, if you're using SPICE).

Setting Up the Pipeline

Redshift Serverless Setup

Create a Redshift Serverless workgroup through the AWS Console or CLI:

aws redshift-serverless create-namespace \
  --namespace-name analytics \
  --admin-username admin \
  --admin-user-password 'YourStrongPassword' \
  --db-name analytics

aws redshift-serverless create-workgroup \
  --workgroup-name analytics-wg \
  --namespace-name analytics \
  --base-capacity 32

Base capacity of 32 RPUs is the minimum and fine for getting started. Redshift Serverless scales up automatically for heavier queries and scales back down when idle.

Create the schemas dbt will use:

CREATE SCHEMA IF NOT EXISTS raw;
CREATE SCHEMA IF NOT EXISTS staging;
CREATE SCHEMA IF NOT EXISTS marts;

dbt Project Pointing at Redshift

Install the Redshift adapter:

pip install dbt-redshift

Configure profiles.yml. For AWS environments, IAM authentication is preferred over username/password:

# ~/.dbt/profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: redshift
      method: iam
      cluster_id: null  # Not needed for Serverless
      host: "analytics-wg.123456789.us-east-1.redshift-serverless.amazonaws.com"
      port: 5439
      dbname: analytics
      schema: staging
      threads: 4
      iam_profile: default  # AWS CLI profile to use
      region: us-east-1

    prod:
      type: redshift
      method: iam
      host: "analytics-wg.123456789.us-east-1.redshift-serverless.amazonaws.com"
      port: 5439
      dbname: analytics
      schema: staging
      threads: 8
      iam_profile: production
      region: us-east-1

If you're not using IAM auth, the password-based config looks like this:

    dev:
      type: redshift
      host: "analytics-wg.123456789.us-east-1.redshift-serverless.amazonaws.com"
      port: 5439
      user: "{{ env_var('REDSHIFT_USER') }}"
      password: "{{ env_var('REDSHIFT_PASSWORD') }}"
      dbname: analytics
      schema: staging
      threads: 4

Verify the connection:

dbt debug

Airflow DAG Example

Here's a simplified DAG that coordinates the full pipeline -- trigger a Glue job, wait for it to complete, run dbt, and refresh a QuickSight dataset:

# dags/aws_data_pipeline.py
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.sensors.glue import GlueJobSensor
from airflow.operators.bash import BashOperator
from airflow.providers.amazon.aws.operators.quicksight import QuickSightCreateIngestionOperator
from datetime import datetime, timedelta
import uuid

default_args = {
    'owner': 'data-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['data-team@company.com'],
}

with DAG(
    'aws_data_pipeline',
    default_args=default_args,
    schedule_interval='0 6 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=['production', 'dbt', 'glue'],
) as dag:

    # Step 1: Trigger Glue ingestion job
    run_glue_job = GlueJobOperator(
        task_id='run_glue_ingestion',
        job_name='ingest-source-data',
        region_name='us-east-1',
        aws_conn_id='aws_default',
        wait_for_completion=True,
        verbose=True,
    )

    # Step 2: Run dbt build
    dbt_build = BashOperator(
        task_id='dbt_build',
        bash_command=(
            'cd /opt/dbt/my-project && '
            'dbt deps && '
            'dbt build --target prod --profiles-dir /opt/dbt/profiles'
        ),
    )

    # Step 3: Refresh QuickSight SPICE dataset
    refresh_quicksight = QuickSightCreateIngestionOperator(
        task_id='refresh_quicksight',
        data_set_id='your-dataset-id',
        ingestion_id=str(uuid.uuid4()),
        aws_conn_id='aws_default',
    )

    run_glue_job >> dbt_build >> refresh_quicksight

This is a simplified version. In practice, you'd likely have multiple Glue jobs for different sources, error handling with on-failure callbacks, and possibly a Slack or SNS notification step.

AWS-Specific Considerations

IAM Roles vs Username/Password

For Redshift authentication, prefer IAM roles over static credentials whenever possible. IAM auth uses temporary credentials that rotate automatically, eliminating the risk of leaked passwords.

In practice, this means:

Your dbt runner (whether that's Airflow, an EC2 instance, or a container) assumes an IAM role
The role has redshift-serverless:GetCredentials permission
dbt's method: iam config uses the role's temporary credentials

This is cleaner and more secure, but it requires your dbt runtime to be running inside AWS (EC2, ECS, Lambda, etc.). If you're running dbt locally or from an external CI system, you'll need static credentials or AWS SSO.

VPC Networking

Redshift Serverless runs inside a VPC. This means anything that needs to connect to it -- Glue jobs, Airflow workers, dbt runners -- needs network access to that VPC.

For Glue, this means creating a Glue Connection with VPC configuration (subnet, security group) so Glue jobs can reach Redshift. This is a common stumbling block: the Glue job runs fine in isolation but fails when trying to connect to Redshift because of missing VPC config.

For Airflow on MWAA, the MWAA environment needs to be in the same VPC as Redshift (or have VPC peering). For self-hosted Airflow, same principle -- your workers need to reach Redshift's endpoint.

Make sure your security groups allow inbound traffic on port 5439 from the relevant CIDR ranges or security groups.

Cost Management

This is where AWS gets complicated. Your data platform touches at least four services, each with different pricing models:

Service	Pricing Model	Rough Cost
Redshift Serverless	RPU-hours consumed	$0.375/RPU-hour (32 RPU minimum, billed per second)
AWS Glue	DPU-hours	$0.44/DPU-hour (minimum 2 DPUs per job)
MWAA	Environment hours + worker hours	~$350/month minimum (smallest environment)
QuickSight	Per-user/month (authors) + per-session (readers)	$24/author/month, $0.30/session for readers
S3	Storage + requests	$0.023/GB/month (Standard)

The RPU-hour pricing for Redshift Serverless deserves a closer look. At the minimum 32 RPU base capacity, you're paying $12/hour when the warehouse is active. Redshift Serverless does scale to zero when idle (no minimum charge when not in use), but during active queries, the cost ramps up quickly. Monitor your usage with CloudWatch and set up billing alerts.

Glue DPU costs can also surprise you. A simple job with 2 DPUs running for 10 minutes costs about $0.15 per run. That's fine. But a complex job with 10 DPUs running for 2 hours costs $8.80 -- and if that runs daily, it adds up.

S3 as the Staging Layer

S3 plays a central role even though it's not in the "main" stack. Glue reads source files from S3 and writes intermediate results there. Redshift uses S3 for managed storage (RA3/Serverless), COPY commands, and UNLOAD operations. dbt can use S3 as a staging location for large materializations.

Organize your S3 buckets by purpose:

s3://company-data-lake/
  raw/                    # Landing zone for source data
    source_a/
    source_b/
  glue/                   # Glue job scripts and temp storage
    scripts/
    temp/
  dbt/                    # dbt artifacts and logs (optional)
    artifacts/

Common Pitfalls

Redshift DIST and SORT Keys

Redshift performance is heavily influenced by how data is physically distributed and sorted on disk. dbt lets you set distribution and sort keys in model configs:

-- models/marts/fct_orders.sql
{{ config(
    materialized='table',
    dist='customer_id',
    sort=['order_date', 'customer_id'],
    sort_type='compound'
) }}

select
    order_id,
    customer_id,
    order_date,
    total_amount
from {{ ref('stg_orders') }}

The wrong distribution key can cause massive data shuffling during joins. The wrong sort key (or no sort key) means Redshift scans entire tables instead of pruning blocks. These choices don't matter much for small datasets, but they make or break performance at scale.

General rules:

DIST key: Choose the column you most frequently join on. If fct_orders is always joined to dim_customers on customer_id, distribute both tables by customer_id.
SORT key: Choose columns you frequently filter on. Date columns are almost always good sort keys.
When in doubt, use DISTSTYLE AUTO and let Redshift figure it out.

Glue Cold Start Times

Glue jobs have a cold start penalty. The first time a job runs (or if it hasn't run in a while), it takes 1-3 minutes just to provision the Spark environment before your actual code executes. This doesn't matter for long-running jobs, but it's frustrating for small, quick jobs.

If cold starts are a problem, consider Glue Python Shell jobs (lighter weight, faster startup) or Glue Ray jobs for Python-native workloads.

QuickSight SPICE Refresh Lag

SPICE caches data in memory for fast dashboard rendering. But that cache needs to be refreshed to pick up new data. If your dbt run finishes at 6:00 AM but the SPICE refresh runs at 6:30 AM, there's a 30-minute window where dashboards show stale data.

The fix is to chain the SPICE refresh directly after the dbt run in your Airflow DAG (as shown in the example above). But QuickSight's refresh API can be slow and occasionally unreliable. Build in retries and monitoring.

IAM Permission Complexity

AWS IAM is powerful but verbose. Your dbt runner needs permissions for Redshift, your Glue jobs need permissions for S3 and Redshift, your Airflow workers need permissions for Glue and QuickSight. Each of these is a separate IAM policy, and getting them right requires trial and error.

Start with broader permissions during development (AWS managed policies like AmazonRedshiftFullAccess), then tighten them for production using the principle of least privilege. CloudTrail logs show you exactly which API calls are being made, which helps you craft precise policies.

The Orchestration Challenge

Airflow is the glue that holds this whole pipeline together. It's also the most operationally demanding component.

If you go with MWAA, the managed service starts at about $350/month for the smallest environment (mw1.small). That's $4,200/year before you factor in worker costs for actual DAG execution. MWAA also has limitations: you can't install arbitrary system packages, Python dependency management can be tricky, and the environment takes 20-30 minutes to create or update.

If you self-host, you're running a distributed system (webserver, scheduler, workers, metadata DB, message broker) that needs monitoring, backups, and regular upgrades. That's a significant ops burden for a team that probably wants to spend its time on data models, not infrastructure.

Either way, you're paying a real cost -- in dollars for MWAA, or in engineering time for self-hosted -- to run what is essentially a fancy cron job with dependency management.

A Simpler Path for the Orchestration Layer

Building a data platform on AWS with Glue, Redshift, dbt, and QuickSight is a solid architecture. Each component is mature, well-documented, and battle-tested. The pain point is the orchestration layer: Airflow is powerful but expensive to run and maintain.

ModelDock handles the Airflow orchestration for your dbt + Redshift pipeline. Connect your Git repo, enter your Redshift credentials (encrypted with AES-256-GCM), set a schedule, and your dbt project runs in an isolated container with full logs and artifact storage. Airflow runs under the hood, but you never have to manage it.

No MWAA bills. No self-hosted Airflow cluster. Just reliable dbt runs on your schedule.

Free during the open beta. Try it at modeldock.run.