dbt-corearchitecturetutorial

The Medallion Architecture Explained: Bronze, Silver, Gold with dbt

A practical guide to implementing the medallion architecture (bronze, silver, gold) with dbt — real model examples, naming conventions, and best practices.

ModelDock TeamFebruary 17, 202613 min read

If you've spent any time around modern data teams, you've heard the terms bronze, silver, and gold thrown around. Maybe in a Databricks talk. Maybe in a system design doc. Maybe from a colleague who insists every warehouse needs exactly three layers.

The medallion architecture is one of those ideas that sounds like marketing but is actually useful. It gives your data pipeline a clear structure — raw data comes in at the bottom, clean data comes out at the top, and every step in between has a well-defined purpose.

The problem is that most guides on medallion architecture are written through a Databricks lens, tied to Delta Lake and Spark-specific tooling. But the concept works everywhere. If you're using dbt with Snowflake, BigQuery, PostgreSQL, Redshift, or anything else, you can implement this pattern today with nothing more than a good folder structure and some naming conventions.

This guide walks through the medallion architecture layer by layer, with real dbt model examples using an e-commerce dataset. No hand-waving — actual SQL you can adapt to your own project.

What Is the Medallion Architecture?

The medallion architecture is a data design pattern that organizes your pipeline into three layers:

Bronze — raw data, loaded as-is from source systems
Silver — cleaned, deduplicated, and properly typed data
Gold — aggregated, business-ready tables for reporting and analytics

The term was popularized by Databricks as part of their lakehouse architecture, where it describes how data flows through a Delta Lake. But the pattern itself isn't new and isn't exclusive to any platform. Data teams have been building staging-intermediate-mart pipelines for years. The medallion architecture just gives that pattern a memorable name and a clear set of rules.

The core principle is simple: data quality and business value increase as you move from bronze to silver to gold. Each layer has a specific responsibility, and models at one layer should only reference models from the same layer or the one below it. Bronze models read from sources. Silver models read from bronze. Gold models read from silver (and occasionally from other gold models).

This layered approach makes your pipeline easier to debug, easier to test, and easier to explain to stakeholders who want to know where a number came from.

The Three Layers, With Real dbt Examples

Let's use a realistic scenario. You're building an analytics pipeline for an e-commerce company. Your source systems include a PostgreSQL application database with tables for orders, customers, and products, plus a Stripe webhook table for payment events.

Bronze Layer: Raw Data (Staging)

The bronze layer is your landing zone. Data arrives here in its original form — same column names, same data types, same messiness. The goal is to create a thin, reliable interface between your source systems and the rest of your pipeline.

In dbt, this maps to models/staging/. Bronze models use the source() function to reference raw tables, and they do only the bare minimum: renaming columns for consistency, casting timestamps, and maybe filtering out obviously invalid records like test rows.

First, define your sources in a YAML file:

# models/staging/ecommerce/_ecommerce__sources.yml
version: 2

sources:
  - name: ecommerce
    database: raw
    schema: public
    freshness:
      warn_after: { count: 12, period: hour }
      error_after: { count: 24, period: hour }
    loaded_at_field: _loaded_at
    tables:
      - name: orders
      - name: customers
      - name: products
      - name: payment_events
        description: "Stripe webhook events for payments"

Then a staging model for orders:

-- models/staging/ecommerce/stg_ecommerce__orders.sql

with source as (
    select * from {{ source('ecommerce', 'orders') }}
),

renamed as (
    select
        id as order_id,
        user_id as customer_id,
        status as order_status,
        total_amount_cents,
        currency,
        shipping_address_id,
        cast(created_at as timestamp) as ordered_at,
        cast(updated_at as timestamp) as updated_at
    from source
)

select * from renamed

And one for customers:

-- models/staging/ecommerce/stg_ecommerce__customers.sql

with source as (
    select * from {{ source('ecommerce', 'customers') }}
),

renamed as (
    select
        id as customer_id,
        lower(trim(email)) as email,
        first_name,
        last_name,
        country_code,
        cast(created_at as timestamp) as created_at
    from source
)

select * from renamed

Notice what's happening here: no joins, no aggregations, no business logic. Just renaming columns to match your project's conventions, normalizing emails, and casting types. If someone asks "what does the raw data look like?", the bronze layer is the answer.

Materialization: Views work well for staging models. They add no storage cost and always reflect the latest source data. In your dbt_project.yml:

models:
  your_project:
    staging:
      +materialized: view

Silver Layer: Cleaned Data (Intermediate)

The silver layer is where the real work begins. Models here clean up data quality issues, deduplicate records, apply business rules, and join related entities together. This is the backbone of your pipeline — not yet shaped for any specific report, but trustworthy enough that downstream consumers can rely on it.

In dbt, this maps to models/intermediate/. Silver models reference bronze (staging) models and never touch raw sources directly.

Here's an intermediate model that builds a clean order history with payment status:

-- models/intermediate/int_orders__enriched.sql

with orders as (
    select * from {{ ref('stg_ecommerce__orders') }}
),

payments as (
    select * from {{ ref('stg_ecommerce__payment_events') }}
),

latest_payment as (
    select
        order_id,
        payment_status,
        payment_method,
        row_number() over (
            partition by order_id
            order by event_timestamp desc
        ) as row_num
    from payments
),

joined as (
    select
        orders.order_id,
        orders.customer_id,
        orders.order_status,
        orders.total_amount_cents / 100.0 as order_total,
        orders.currency,
        latest_payment.payment_status,
        latest_payment.payment_method,
        orders.ordered_at,
        orders.updated_at
    from orders
    left join latest_payment
        on orders.order_id = latest_payment.order_id
        and latest_payment.row_num = 1
)

select * from joined

And an intermediate customer model that handles deduplication:

-- models/intermediate/int_customers__deduplicated.sql

with customers as (
    select * from {{ ref('stg_ecommerce__customers') }}
),

deduplicated as (
    select
        customer_id,
        email,
        first_name,
        last_name,
        country_code,
        created_at,
        row_number() over (
            partition by email
            order by created_at asc
        ) as row_num
    from customers
),

final as (
    select
        customer_id,
        email,
        first_name,
        last_name,
        country_code,
        created_at
    from deduplicated
    where row_num = 1
)

select * from final

This is where you handle the messy reality of production data. Duplicate customer accounts with the same email? Resolved here. Payment events that need to be collapsed to the latest status? Done here. Currency conversion from cents to dollars? Here.

The silver layer is also the natural place for data that multiple gold models will need. If three different dashboards need enriched order data, you build it once in silver and reference it from gold — not three times in three separate mart models.

Materialization: Intermediate models benefit from being materialized as tables or incremental models, especially if they involve expensive joins or window functions. For large datasets, incremental is the right call:

models:
  your_project:
    intermediate:
      +materialized: table

Gold Layer: Business-Ready Data (Marts)

The gold layer is what your stakeholders actually see. These are the tables that power dashboards, feed into reverse ETL tools, and get queried by analysts writing ad-hoc reports. They're shaped for specific business questions and optimized for consumption.

In dbt, this maps to models/marts/. Gold models use the fct_ (fact) and dim_ (dimension) prefixes to signal what kind of table they are.

A fact table for order metrics:

-- models/marts/fct_orders.sql

with orders as (
    select * from {{ ref('int_orders__enriched') }}
),

final as (
    select
        order_id,
        customer_id,
        order_status,
        payment_status,
        payment_method,
        order_total,
        currency,
        ordered_at,
        date_trunc('month', ordered_at) as order_month,
        case
            when payment_status = 'succeeded' then 'paid'
            when payment_status = 'refunded' then 'refunded'
            when order_status = 'cancelled' then 'cancelled'
            else 'pending'
        end as order_state
    from orders
)

select * from final

A dimension table for customers:

-- models/marts/dim_customers.sql

with customers as (
    select * from {{ ref('int_customers__deduplicated') }}
),

orders as (
    select * from {{ ref('int_orders__enriched') }}
),

customer_orders as (
    select
        customer_id,
        count(*) as total_orders,
        sum(order_total) as lifetime_value,
        min(ordered_at) as first_order_at,
        max(ordered_at) as most_recent_order_at
    from orders
    group by customer_id
),

final as (
    select
        customers.customer_id,
        customers.email,
        customers.first_name,
        customers.last_name,
        customers.country_code,
        customers.created_at as customer_since,
        coalesce(customer_orders.total_orders, 0) as total_orders,
        coalesce(customer_orders.lifetime_value, 0) as lifetime_value,
        customer_orders.first_order_at,
        customer_orders.most_recent_order_at
    from customers
    left join customer_orders
        on customers.customer_id = customer_orders.customer_id
)

select * from final

Materialization: Gold models should almost always be tables. They're queried frequently and need to be fast. If your fact tables are large and append-heavy, incremental materialization is the way to go:

models:
  your_project:
    marts:
      +materialized: table

dbt Project Structure for Medallion Architecture

Here's the full folder layout that maps the medallion layers to dbt conventions:

models/
  staging/
    ecommerce/
      _ecommerce__sources.yml
      _ecommerce__models.yml
      stg_ecommerce__orders.sql
      stg_ecommerce__customers.sql
      stg_ecommerce__products.sql
      stg_ecommerce__payment_events.sql
    stripe/
      _stripe__sources.yml
      stg_stripe__charges.sql
  intermediate/
    _int__models.yml
    int_orders__enriched.sql
    int_customers__deduplicated.sql
    int_products__categorized.sql
  marts/
    _marts__models.yml
    fct_orders.sql
    fct_daily_revenue.sql
    dim_customers.sql
    dim_products.sql

A few things to notice:

Staging models are grouped by source system, not by business domain. This keeps it clear which raw tables each model depends on.
Intermediate and mart models are grouped by business domain (or kept flat if your project is small enough). As your project grows, you might add subfolders like marts/finance/ and marts/marketing/.
YAML schema files live next to the models they describe, prefixed with an underscore so they sort to the top of the directory.

Naming Conventions

Consistent naming is what makes a large dbt project navigable. Here are the prefixes that most dbt projects follow, and they map cleanly to medallion layers:

Layer	Prefix	Example	Purpose
Bronze	`stg_`	`stg_ecommerce__orders`	1:1 with source tables
Silver	`int_`	`int_orders__enriched`	Cleaned, joined, intermediate
Gold	`fct_`	`fct_orders`	Fact tables (events, transactions)
Gold	`dim_`	`dim_customers`	Dimension tables (entities, attributes)

The double underscore (__) separates the source or entity name from the descriptor. stg_ecommerce__orders tells you it's a staging model from the ecommerce source for the orders table. int_orders__enriched tells you it's an intermediate model that enriches order data.

This isn't just cosmetics. When you have 200 models, being able to glance at a name and immediately know which layer it belongs to saves real time.

How Layers Map to dbt Materializations

Choosing the right materialization for each layer has a meaningful impact on cost and performance:

Bronze (staging) -- Views. Staging models are simple renames and casts. Materializing them as views means they take up no warehouse storage and always reflect the freshest source data. The compute cost is negligible since the transformations are trivial.

Silver (intermediate) -- Tables or incremental. Intermediate models often involve joins, window functions, and deduplication logic. Materializing them as tables avoids recomputing expensive transformations every time a downstream model runs. For very large tables (hundreds of millions of rows), switch to incremental to avoid full rebuilds on every run.

Gold (marts) -- Tables. Mart models are what analysts and BI tools query directly. They need to be fast. Tables are the right default. For append-heavy fact tables, incremental models keep run times manageable as data grows.

Here's the full configuration in dbt_project.yml:

models:
  your_project:
    staging:
      +materialized: view
    intermediate:
      +materialized: table
    marts:
      +materialized: table

Testing Strategy Per Layer

Each layer has different testing priorities. Spreading your tests across layers keeps the overall test suite fast while still catching the issues that matter.

Bronze: Source Freshness

The most important thing to test at the bronze layer is whether data is actually arriving. Use dbt's source freshness checks:

dbt source freshness

This checks the freshness configuration in your source YAML (which we defined earlier). If orders haven't been loaded in 24 hours, something is wrong upstream and you want to know about it before your pipeline runs.

You should also add basic schema tests on staging models:

# models/staging/ecommerce/_ecommerce__models.yml
version: 2

models:
  - name: stg_ecommerce__orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null

Silver: Data Quality

The silver layer is where you enforce data quality. After deduplication and cleaning, your tests should verify the results:

# models/intermediate/_int__models.yml
version: 2

models:
  - name: int_customers__deduplicated
    columns:
      - name: email
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: int_orders__enriched
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true

Gold: Business Logic

Gold layer tests validate business rules. These are the tests that catch logic errors that would show up in dashboards:

# models/marts/_marts__models.yml
version: 2

models:
  - name: fct_orders
    columns:
      - name: order_state
        tests:
          - accepted_values:
              values: ['paid', 'refunded', 'cancelled', 'pending']

  - name: dim_customers
    columns:
      - name: total_orders
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true
      - name: lifetime_value
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true

You can also add custom singular tests for more nuanced business logic -- for example, verifying that daily revenue in fct_daily_revenue doesn't deviate more than 50% from the trailing 7-day average, which might indicate a data loading issue.

When NOT to Use Medallion Architecture

The medallion architecture is not always the right answer. Here's when you should skip it:

Your project has fewer than 15-20 models. If your entire pipeline is five staging models and three marts, adding an intermediate layer creates overhead with little benefit. Go straight from staging to marts. You can always add the silver layer later when complexity demands it.

Your data sources are already clean. If you're pulling from a well-maintained SaaS API that delivers consistent, typed, deduplicated data, the silver layer might be doing nothing useful. Not every source needs three layers of processing.

You're building a one-off analysis. If you're answering a specific business question and the pipeline won't be maintained long-term, skip the ceremony. A single model that reads from sources and produces a result is fine.

Your team is small and moves fast. Architecture patterns have a coordination cost. If it's just you and one other analyst, the overhead of maintaining strict layer boundaries might slow you down more than it helps. Use the conventions that feel natural and formalize them when the team grows.

The medallion pattern shines when you have multiple source systems, a growing model count, and several people working in the same dbt project. It gives everyone a shared mental model for where things live and how data flows. But like any architecture decision, it should be driven by actual need rather than theoretical purity.

Wrapping Up

The medallion architecture is a straightforward way to organize your dbt project as it scales. Bronze gives you a clean interface to your source systems. Silver handles the messy transformation work. Gold delivers business-ready tables that analysts and dashboards can trust.

The beauty of implementing this in dbt is that you don't need any special tooling. No Delta Lake, no Spark, no proprietary platform features. Just folders, naming conventions, and the ref function doing what it does best -- building a clean dependency graph from raw data to business insight.

Start with staging and marts. Add intermediate when you find yourself duplicating transformation logic across multiple mart models. Let the architecture grow with your project rather than trying to build the perfect structure on day one.

ModelDock handles dbt scheduling so you can focus on your data architecture. Connect your Git repo, configure your warehouse credentials, and let ModelDock run your dbt builds on Airflow -- no infrastructure to manage. Try it free.