Data Intelligence

ETL Pipeline Guide: Building Data Infrastructure That Scales With Your Business

Your business generates data in dozens of systems simultaneously. Making that data useful for analysis, reporting, and machine learning requires moving it, transforming it, and loading it reliably into a single place. That is what ETL pipelines do — and the difference between doing it well and doing it poorly compounds every month.

Why ETL Pipeline Architecture Matters

Data engineering is the unglamorous foundation beneath every data-driven business capability. Business intelligence dashboards, predictive models, customer data platforms, and AI features all depend on a reliable, fresh, and accurate supply of data. When the pipelines feeding those systems are brittle — failing silently, delivering stale data, or producing inconsistent results — every downstream system that depends on them degrades.

The consequences of poor ETL architecture are often invisible until they are catastrophic. Marketing teams making budget decisions on yesterday's data without knowing it is stale. Finance teams reconciling reports that do not match because two dashboards use different pipeline definitions of the same metric. ML models trained on corrupted features that produce confident but wrong predictions. These failure modes are real and common.

80% of data science project time is spent on data preparation and pipeline work rather than modelling — a figure that has remained stubbornly consistent for a decade, highlighting how central data engineering quality is to the entire analytics value chain (IBM Data Science Survey, 2025).

ETL vs ELT: The Modern Default

Traditional ETL (Extract, Transform, Load) processes and transforms data before loading it into the destination warehouse. This approach made sense when destination storage was expensive and compute was colocated with the source systems.

Modern ELT (Extract, Load, Transform) inverts the last two steps: raw data is loaded into the data warehouse first, then transformed using the warehouse's own compute engine. This is now the dominant pattern for three reasons. Cloud data warehouses (BigQuery, Snowflake, Redshift, DuckDB) have made analytical compute dramatically cheaper. Raw data is preserved, enabling reprocessing when transformation logic changes without re-extracting from source. And transformation code (SQL running in dbt) is version-controlled, testable, and auditable in a way that ETL transformation logic often is not.

The Modern ELT Stack Architecture

Layer 1: Extraction and Ingestion

The ingestion layer moves data from source systems — your production database, SaaS tools, third-party APIs, event streams, and files — into the raw zone of your data warehouse. This layer must handle the diversity of source protocols, manage authentication and rate limits, detect incremental changes efficiently, and deliver data reliably without duplicates or gaps.

Building and maintaining source connectors from scratch is expensive engineering work. Most organisations are better served by managed connector platforms that handle the connector engineering and maintenance burden.

Managed Connectors: Fivetran vs Airbyte

Fivetran is the market leader for managed, fully hosted data connectors. It offers 300+ pre-built connectors for SaaS tools, databases, and APIs, with automatic schema migration, incremental syncing, and comprehensive monitoring. Fivetran's pricing is based on monthly active rows (MAR) — the number of distinct rows synced per month. It is the lowest-friction option for standard SaaS-to-warehouse data movement.

Airbyte is the open-source alternative, available both self-hosted (free) and as a cloud service. It offers a comparable connector library with a more permissive pricing model. Self-hosted Airbyte requires infrastructure management and operational overhead that Fivetran eliminates. Cloud Airbyte offers a competitive alternative to Fivetran at mid-scale. For cost-sensitive organisations with engineering capacity to manage infrastructure, Airbyte self-hosted is compelling.

For sources without managed connectors — proprietary internal systems, custom APIs, specialised data sources — custom ingestion code is unavoidable. Build these using a lightweight framework (Singer.io taps, or the source connector SDK for either Fivetran or Airbyte) rather than bespoke scripts, so the connector can be maintained and monitored consistently alongside managed connectors.

Layer 2: The Data Warehouse

The data warehouse is the central storage layer for all analytical data. It receives raw data from the ingestion layer, stores it in the bronze/raw zone, and provides the compute engine for transformations. In 2026, the four dominant choices are:

BigQuery (Google Cloud): Serverless, pay-per-query pricing, excellent for organisations already in GCP. Strong ML integration with Vertex AI. Scales to petabytes without infrastructure management.

Snowflake: Multi-cloud (AWS, GCP, Azure), strong data sharing capabilities, compute-storage separation allows independent scaling. The preferred choice for organisations with multi-cloud or data sharing requirements.

Redshift (AWS): Strong integration with the AWS ecosystem, competitive at high-volume workloads. Serverless Redshift reduces operational overhead significantly versus the older provisioned cluster model.

DuckDB: An emerging in-process analytical database that excels for smaller-scale analytics without cloud infrastructure overhead. Increasingly used alongside cloud warehouses for local development and small-dataset analytics.

$65 billion — projected cloud data warehouse market by 2028, with Snowflake and BigQuery capturing the majority of new workloads. The shift from on-premise analytical databases to cloud warehouses is essentially complete for organisations founded after 2018 (IDC, 2025).

Layer 3: Transformation with dbt

dbt (data build tool) has become the standard transformation layer in modern ELT stacks. It runs SQL transformations as version-controlled code, generates documentation and data lineage diagrams automatically, and provides a testing framework for data quality assertions. The dbt ecosystem has matured significantly — dbt Cloud offers scheduling, alerting, and a metadata-rich IDE. dbt Core remains available as a free open-source tool for teams that prefer self-hosted execution.

dbt projects follow the medallion architecture: staging models (bronze → silver) clean and standardise raw source data. Intermediate models join and reshape staging data. Mart models (gold) compute the final business metrics that BI tools consume. Each layer is independently testable and documented.

dbt tests are the primary mechanism for data quality assurance in the transformation layer. Built-in tests check for nulls, unique values, referential integrity between tables, and accepted value sets. Custom tests validate business logic — revenue is never negative, conversion rates are between 0 and 1, user counts do not exceed total registered users. Run these tests on every dbt job and alert on failures before downstream consumers are affected.

Layer 4: Orchestration

Orchestration coordinates the sequence of pipeline steps: run ingestion first, then run dbt transformations after ingestion completes, then refresh BI dashboards after transformations complete. Without orchestration, pipeline steps run on independent schedules with no dependency awareness — transformations run on stale data, dashboards refresh before data is ready.

Apache Airflow is the most widely used orchestration platform for data pipelines. Its Python-based DAG (Directed Acyclic Graph) definition gives data engineers full programmatic control over pipeline logic, branching, and error handling. Managed Airflow is available through Astronomer (Astro), Google Cloud Composer, and Amazon MWAA, reducing the operational burden of self-hosting.

Dagster is an increasingly popular alternative to Airflow, designed specifically for data pipelines with a stronger asset-oriented mental model, better testing ergonomics, and more accessible UI for non-engineers. It is the preferred choice for teams starting fresh in 2026.

Prefect offers a simpler Python API than Airflow with cloud-based execution. Well-suited to teams with less data engineering specialisation who need reliable scheduling without Airflow's operational complexity.

Change Data Capture: Near-Real-Time Data Movement

Batch pipelines that run hourly or daily introduce data latency — your data warehouse is always some hours behind your production systems. For most analytical use cases, this is acceptable. For operational analytics — real-time dashboards, fraud detection, immediate customer lifecycle triggers — this latency is a product problem.

Change Data Capture (CDC) reads the database transaction log (the binary log in MySQL, the WAL in PostgreSQL) to capture row-level changes — inserts, updates, and deletes — as they occur, and streams them to the data warehouse in near-real-time. Debezium is the open-source CDC standard, integrating with Kafka for event streaming. Fivetran and Airbyte both offer CDC connectors for major databases.

CDC pipelines require careful operational management: log retention on source databases must be configured, position tracking must be maintained, and schema changes must be propagated without data loss. For organisations that need it, CDC is worth the complexity. For most analytical use cases, a well-tuned batch pipeline with 15–60 minute frequency provides adequate freshness at much lower operational cost.

Data Quality: The Non-Negotiable Foundation

A data pipeline that delivers incorrect data is worse than no pipeline at all — it creates confident wrong answers rather than admitted uncertainty. Data quality must be designed into every layer of the pipeline, not bolted on as an afterthought.

At the ingestion layer: validate that expected schemas match actual schemas, alert on missing source data, and track record counts to detect unexpected drops in volume. At the transformation layer: run dbt tests on every model, assert that business logic produces valid results, and check that metrics are within expected ranges. At the serving layer: monitor data freshness in BI tools, compare dashboard metrics against source-of-truth queries periodically, and establish a clear process for reporting and resolving data quality incidents.

$12.9 million — average annual cost of poor data quality to organisations, including wasted analyst time investigating anomalies, incorrect business decisions made on bad data, and the engineering cost of remediating data quality incidents (Gartner, 2025).

Scaling Your Pipeline Infrastructure

Pipeline architecture must scale with your data volume without requiring redesign at each order-of-magnitude increase. Partitioned tables in your data warehouse — splitting large tables by date or tenant — are the most important scaling mechanism. A query on a table partitioned by date only scans the relevant partitions, reducing compute cost and latency dramatically as table size grows.

Incremental dbt models — models that only process new or changed records on each run rather than reprocessing the entire table — are essential for maintaining fast transformation runs as data volume grows. A full refresh transformation of a 100-million-row table may take 45 minutes; an incremental transformation that processes only the last hour's changes takes 90 seconds.

Frequently Asked Questions

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading it into the destination. ELT (Extract, Load, Transform) loads raw data into the destination first, then transforms it there. ELT is now the dominant pattern for cloud data warehouses because warehouse compute is cheaper than custom transformation infrastructure, and raw data is preserved for reprocessing.

Should I build my own ETL pipeline or use a managed tool like Fivetran?

Use a managed connector tool (Fivetran, Airbyte) for standard SaaS-to-warehouse data movement. These tools handle connector maintenance, schema drift detection, and incremental syncing — problems that are expensive to build and maintain yourself. Build custom pipelines only for sources without managed connectors, for proprietary data, or when data volume makes managed tool pricing prohibitive.

How do I handle schema changes in ETL pipelines?

Design pipelines to be schema-tolerant: detect new columns automatically and add them to destination tables rather than failing. Use schema registries for event-based pipelines. Version your transformation models. Build monitoring that alerts on new columns, changed data types, or missing expected fields. Schema drift handled proactively is an inconvenience; handled reactively it is an outage.

What is data freshness and how do I improve it?

Data freshness is the lag between when data is created in a source system and when it is available for analysis. Batch pipelines typically deliver 1–24 hour freshness. Change Data Capture (CDC) pipelines deliver near-real-time freshness (seconds to minutes). Improve freshness by increasing sync frequency for critical sources, using CDC for high-priority tables, and monitoring data age with automated alerts.

How much does a production ETL pipeline cost to run?

Managed connector tools (Fivetran) cost $500–$5,000/month depending on data volume and connector count. Open-source alternatives (Airbyte self-hosted) have lower tooling costs but require engineering time to maintain. Data warehouse compute costs vary — a typical mid-market data stack costs $2,000–$8,000/month in total infrastructure and tooling.

What is the medallion architecture in data engineering?

The medallion architecture organises data warehouse tables into three quality tiers: Bronze (raw ingested data as it arrived from source), Silver (cleaned, standardised, and validated data), and Gold (aggregated, business-logic-applied data ready for BI tools). This layered approach makes data lineage clear, enables reprocessing from raw data, and separates concerns cleanly between data engineering and analytics engineering.

Need a Data Pipeline That Actually Scales?

We design and build ELT pipelines that grow with your business — from selecting the right tools to implementing a production-grade data stack with monitoring, testing, and documentation baked in from day one.

Get a Free Consultation