Skip to main content
Data pipeline architecture diagram with modern engineering tools
Data Engineering

Modern Data Engineering Stack: Build Scalable Data Pipelines

Cesar Adames

Learn how to architect production-grade data pipelines using modern tools like dbt, Airflow, and Snowflake. Best practices for ETL, data quality, and real-time processing.

#data-engineering #etl #data-pipeline #dbt #airflow #snowflake

Modern Data Engineering Stack: Build Scalable Data Pipelines

The modern data stack has transformed how organizations build data infrastructure. Cloud-native tools, modular architectures, and SQL-first approaches enable small teams to process petabytes of data that previously required massive engineering organizations.

Architecture Components

Data Warehouse

Cloud data warehouses provide compute and storage separation:

Snowflake:

  • Virtual warehouses for workload isolation
  • Automatic scaling and concurrency
  • Time travel and zero-copy cloning
  • Secure data sharing across organizations

BigQuery:

  • Serverless, fully-managed
  • Pay-per-query pricing
  • Native machine learning (BQML)
  • Real-time analytics on streaming data

Databricks:

  • Unified data lakehouse architecture
  • Delta Lake for ACID transactions
  • ML workflows and model deployment
  • Photon engine for query acceleration

Orchestration

Airflow remains the de facto standard for workflow management:

from airflow import DAG
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True
}

with DAG(
    'revenue_pipeline',
    default_args=default_args,
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    catchup=False
) as dag:

    extract_data = SnowflakeOperator(
        task_id='extract_sales_data',
        sql='sql/extract_sales.sql',
        snowflake_conn_id='snowflake_prod'
    )

    transform_data = SnowflakeOperator(
        task_id='transform_revenue_metrics',
        sql='sql/transform_revenue.sql',
        snowflake_conn_id='snowflake_prod'
    )

    data_quality_check = SnowflakeOperator(
        task_id='validate_revenue_totals',
        sql='sql/quality_checks.sql',
        snowflake_conn_id='snowflake_prod'
    )

    extract_data >> transform_data >> data_quality_check

Alternatives:

  • Prefect: Modern Python workflows with type safety
  • Dagster: Software-defined assets and testing
  • Mage: GUI-based pipeline development

Transformation - dbt

Analytics engineers use SQL to transform raw data:

-- models/staging/stg_orders.sql
with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        order_id,
        customer_id,
        order_date,
        order_total,
        status,
        _fivetran_synced as loaded_at
    from source
)

select * from renamed

dbt Features:

  • Version control for analytics code (Git)
  • Testing and documentation
  • Incremental models for efficiency
  • Dependency management (DAG)
  • CI/CD integration

Data Integration

Move data from sources to warehouse:

Fivetran:

  • 200+ pre-built connectors
  • Automated schema detection
  • Change data capture (CDC)
  • Managed infrastructure

Airbyte (Open Source):

  • Custom connector development
  • Self-hosted or cloud
  • Incremental sync strategies
  • Transformation via dbt

Stitch (Talend):

  • Simple setup and configuration
  • Enterprise support
  • Replication monitoring

Real-Time Data Pipelines

Streaming Architecture

Process events as they occur:

Kafka:

  • Distributed event streaming platform
  • High throughput, low latency
  • Horizontal scalability
  • Message replay capability

Example Consumer (Python):

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'user_events',
    bootstrap_servers=['kafka:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest',
    enable_auto_commit=True,
    group_id='event_processor'
)

for message in consumer:
    event = message.value
    # Process event (enrich, validate, transform)
    process_event(event)
    # Write to destination (warehouse, cache, database)
    write_to_destination(event)

Stream Processing:

  • Flink: Complex event processing, stateful computations
  • Spark Streaming: Micro-batch processing
  • Kafka Streams: Lightweight stream processing

Change Data Capture

Track database changes in real-time:

Debezium:

  • Capture row-level changes from databases
  • Support for MySQL, PostgreSQL, MongoDB, SQL Server
  • Kafka-based distribution
  • Exactly-once delivery guarantees

Data Quality Framework

Testing Strategy

Implement comprehensive data quality checks:

dbt Tests:

# models/schema.yml
version: 2

models:
  - name: fct_orders
    description: "Order fact table with revenue metrics"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id

      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true

Great Expectations:

  • Profiling and validation
  • Custom expectations
  • Data documentation
  • Integration with pipelines

Monitoring and Alerts

Track pipeline health and data freshness:

Monte Carlo or Datafold:

  • Anomaly detection on metrics
  • Schema change monitoring
  • Data lineage tracking
  • Incident management

Custom Monitoring:

-- Freshness check
SELECT
    table_name,
    max(updated_at) as last_update,
    DATEDIFF(hour, max(updated_at), CURRENT_TIMESTAMP()) as hours_stale
FROM information_schema.tables
WHERE schema = 'ANALYTICS'
  AND DATEDIFF(hour, max(updated_at), CURRENT_TIMESTAMP()) > 6

Cost Optimization

Warehouse Performance

Optimize compute costs:

Query Optimization:

  • Minimize data scanned (column pruning, partition filtering)
  • Appropriate clustering keys
  • Materialized views for repeated queries
  • Result caching

Resource Management:

  • Auto-suspend idle warehouses
  • Warehouse sizing based on workload
  • Separate warehouses for different teams/workloads
  • Monitor query performance and cost

Storage Efficiency

Reduce storage costs:

Data Lifecycle:

  • Archive old data to cheaper storage (S3 Glacier)
  • Implement retention policies
  • Compress historical data
  • Delete unused tables/views

Deduplication:

-- Remove duplicates using ROW_NUMBER()
DELETE FROM orders
WHERE order_id IN (
    SELECT order_id
    FROM (
        SELECT
            order_id,
            ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) as rn
        FROM orders
    )
    WHERE rn > 1
)

Security and Governance

Access Control

Implement least-privilege access:

Role-Based Access (Snowflake):

-- Create roles
CREATE ROLE analyst;
CREATE ROLE data_engineer;

-- Grant schema access
GRANT USAGE ON DATABASE analytics TO ROLE analyst;
GRANT USAGE ON SCHEMA analytics.reporting TO ROLE analyst;

-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.reporting TO ROLE analyst;

-- Assign roles to users
GRANT ROLE analyst TO USER jane_doe;

Row-Level Security:

-- Create security policy
CREATE ROW ACCESS POLICY customer_region_policy AS (region_col STRING) RETURNS BOOLEAN ->
    region_col = CURRENT_USER_REGION()
OR IS_ROLE_IN_SESSION('ADMIN');

-- Apply policy to table
ALTER TABLE customers ADD ROW ACCESS POLICY customer_region_policy ON (region);

Data Lineage

Track data provenance:

  • dbt docs for transformation lineage
  • Marquez for end-to-end lineage
  • OpenLineage standard
  • Impact analysis for changes

Deployment Best Practices

Environment Strategy

Maintain separate environments:

Development:

  • Individual dev schemas per engineer
  • Clone production data (anonymized)
  • Fast iteration and testing

Staging:

  • Pre-production validation
  • Integration testing
  • Performance testing with production-like data

Production:

  • Automated deployments via CI/CD
  • Rollback capability
  • Monitoring and alerting

CI/CD Pipeline

Automate testing and deployment:

# .github/workflows/deploy.yml
name: Deploy Data Pipeline

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9

      - name: Install dbt
        run: pip install dbt-snowflake

      - name: Run dbt tests
        run: dbt test --profiles-dir .

      - name: Check data quality
        run: dbt run-operation check_freshness

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: dbt run --target prod

Implementation Roadmap

Weeks 1-2: Foundation

  • Select cloud data warehouse
  • Set up data integration tool (Fivetran/Airbyte)
  • Configure source connections
  • Ingest initial data

Weeks 3-4: Transformation

  • Implement dbt project structure
  • Build staging models
  • Create core business logic transformations
  • Write initial tests

Weeks 5-6: Orchestration

  • Deploy Airflow (or alternative)
  • Create DAGs for pipeline orchestration
  • Set up monitoring and alerts
  • Implement data quality checks

Weeks 7-8: Optimization

  • Performance tuning
  • Cost optimization
  • Documentation
  • Team training

Team Structure

Data Engineer: Infrastructure, orchestration, integrations Analytics Engineer: dbt transformations, data modeling Data Analyst: Business logic, reporting, stakeholder engagement

Modern data stacks enable lean teams to deliver enterprise-grade data infrastructure. Partner with experts to accelerate implementation and avoid common pitfalls.

Ready to Transform Your Business?

Let's discuss how our AI and technology solutions can drive revenue growth for your organization.