MLOps Pipelines: Production-Grade ML Infrastructure

The MLOps Imperative

Machine learning models that perform excellently in notebooks often fail in production due to data drift, infrastructure issues, and lack of reproducibility. MLOps—the practice of applying DevOps principles to machine learning systems—bridges the gap between experimental models and reliable production systems.

Organizations with mature MLOps practices deploy models 10x faster, reduce failures by 80%, and deliver consistent business value from their AI investments.

Core MLOps Components

1. Version Control

What to Version

Code: Training scripts, preprocessing, inference logic
Data: Training/validation datasets, feature engineering code
Models: Serialized models, architecture definitions
Configurations: Hyperparameters, infrastructure settings
Experiments: Metrics, parameters, artifacts

Tools & Best Practices

Code Versioning (Git)

ml-project/
├── src/
│   ├── data/
│   │   ├── ingestion.py
│   │   └── preprocessing.py
│   ├── features/
│   │   └── engineering.py
│   ├── models/
│   │   ├── train.py
│   │   └── predict.py
│   └── evaluation/
│       └── metrics.py
├── configs/
│   ├── training_config.yaml
│   └── deployment_config.yaml
├── tests/
│   ├── test_data.py
│   ├── test_features.py
│   └── test_models.py
└── requirements.txt

Data Versioning (DVC)

# Initialize DVC
dvc init

# Track dataset
dvc add data/training_data.csv

# Commit DVC file (not actual data)
git add data/training_data.csv.dvc .gitignore
git commit -m "Add training dataset v1.0"

# Push data to remote storage (S3, GCS, Azure)
dvc push

# Reproduce with specific data version
git checkout v1.0
dvc pull

Model Versioning (MLflow)

import mlflow

# Log model with versioning
with mlflow.start_run():
    # Train model
    model = train_model(X_train, y_train)

    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)

    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)

    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")

    # Tag production-ready models
    mlflow.set_tag("stage", "production")

2. Automated Training Pipelines

Pipeline Orchestration

Apache Airflow Example

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'ml_training_pipeline',
    default_args=default_args,
    description='End-to-end ML training pipeline',
    schedule_interval='@weekly',  # Retrain weekly
    start_date=datetime(2025, 1, 1),
    catchup=False,
)

# Data extraction
extract_data = PythonOperator(
    task_id='extract_data',
    python_callable=extract_data_from_warehouse,
    dag=dag,
)

# Data validation
validate_data = PythonOperator(
    task_id='validate_data',
    python_callable=validate_data_quality,
    dag=dag,
)

# Feature engineering
engineer_features = PythonOperator(
    task_id='engineer_features',
    python_callable=create_features,
    dag=dag,
)

# Model training
train_model = PythonOperator(
    task_id='train_model',
    python_callable=train_and_log_model,
    dag=dag,
)

# Model evaluation
evaluate_model = PythonOperator(
    task_id='evaluate_model',
    python_callable=evaluate_on_holdout,
    dag=dag,
)

# Conditional deployment
deploy_model = PythonOperator(
    task_id='deploy_model',
    python_callable=deploy_if_improved,
    dag=dag,
)

# Define dependencies
extract_data >> validate_data >> engineer_features >> train_model >> evaluate_model >> deploy_model

Kubeflow Pipelines

Kubernetes-native ML workflows
Containerized pipeline components
Scalable, cloud-agnostic
Built-in experiment tracking

Prefect / Metaflow

Python-native pipeline frameworks
Dynamic workflow generation
Easy local development
Production-grade scheduling

3. Continuous Integration for ML

Model Testing Strategy

Unit Tests

import pytest
import numpy as np
from src.models.train import RandomForestModel

def test_model_training():
    """Test model can train on sample data"""
    X = np.random.rand(100, 10)
    y = np.random.randint(0, 2, 100)

    model = RandomForestModel()
    model.fit(X, y)

    assert model.is_fitted()
    assert model.n_features == 10

def test_model_predictions():
    """Test model predictions have correct shape and range"""
    X_train = np.random.rand(100, 10)
    y_train = np.random.randint(0, 2, 100)
    X_test = np.random.rand(20, 10)

    model = RandomForestModel()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    assert predictions.shape[0] == 20
    assert all(p in [0, 1] for p in predictions)

def test_model_serialization():
    """Test model can be saved and loaded"""
    model = RandomForestModel()
    model.fit(X_train, y_train)

    # Save
    model.save('test_model.pkl')

    # Load
    loaded_model = RandomForestModel.load('test_model.pkl')

    # Verify equivalence
    np.testing.assert_array_equal(
        model.predict(X_test),
        loaded_model.predict(X_test)
    )

Data Tests

def test_data_schema():
    """Validate training data schema"""
    df = load_training_data()

    # Check required columns
    required_cols = ['feature_1', 'feature_2', 'target']
    assert all(col in df.columns for col in required_cols)

    # Check data types
    assert df['feature_1'].dtype == 'float64'
    assert df['target'].dtype == 'int64'

    # Check value ranges
    assert df['feature_1'].between(0, 100).all()
    assert df['target'].isin([0, 1]).all()

def test_data_freshness():
    """Ensure data is recent"""
    df = load_training_data()
    max_date = df['timestamp'].max()
    days_old = (datetime.now() - max_date).days

    assert days_old < 7, f"Data is {days_old} days old"

def test_no_data_leakage():
    """Verify no future data in training set"""
    df = load_training_data()

    # Check timestamps
    assert df['feature_created_at'] <= df['target_event_at']

Model Performance Tests

def test_model_accuracy_threshold():
    """Model meets minimum accuracy requirement"""
    model = load_latest_model()
    X_test, y_test = load_holdout_data()

    accuracy = model.score(X_test, y_test)
    assert accuracy >= 0.85, f"Model accuracy {accuracy} below threshold"

def test_model_bias():
    """Check for demographic bias"""
    model = load_latest_model()
    test_data = load_test_data_with_demographics()

    for group in test_data['demographic_group'].unique():
        group_data = test_data[test_data['demographic_group'] == group]
        group_accuracy = calculate_accuracy(model, group_data)

        # Ensure no group has significantly lower accuracy
        assert group_accuracy >= 0.80, f"{group} accuracy too low: {group_accuracy}"

4. Continuous Deployment (CD)

Deployment Strategies

Blue-Green Deployment

def blue_green_deploy(new_model_version):
    """
    Deploy new model while keeping old version (rollback capability)
    """
    # Deploy new model to "green" environment
    green_endpoint = deploy_to_green(new_model_version)

    # Run smoke tests
    if not smoke_test(green_endpoint):
        raise DeploymentError("Smoke tests failed")

    # Gradually shift traffic (0% → 10% → 50% → 100%)
    for traffic_pct in [10, 50, 100]:
        shift_traffic(green_endpoint, traffic_pct)
        sleep(300)  # Monitor for 5 minutes

        # Check error rates
        if get_error_rate(green_endpoint) > threshold:
            rollback_traffic()
            raise DeploymentError("Error rate too high")

    # Mark green as primary, keep blue for rollback
    promote_to_production(green_endpoint)

Canary Deployment

def canary_deploy(new_model_version):
    """
    Deploy to small subset of traffic, gradually increase
    """
    canary_endpoint = deploy_canary(new_model_version)

    # Start with 5% traffic
    route_traffic(canary_endpoint, percentage=5)

    # Monitor key metrics
    for hour in range(24):
        metrics = get_canary_metrics(canary_endpoint)

        if metrics['error_rate'] > baseline_error_rate * 1.2:
            rollback_canary()
            raise DeploymentError("Canary error rate elevated")

        if metrics['latency_p95'] > sla_latency:
            rollback_canary()
            raise DeploymentError("Canary latency too high")

        sleep(3600)  # Wait 1 hour

    # All good, promote to 100%
    promote_canary_to_production()

Shadow Mode

def shadow_deploy(new_model_version):
    """
    Deploy new model but don't use predictions (log only)
    """
    shadow_endpoint = deploy_shadow(new_model_version)

    # Route requests to both models
    configure_dual_prediction(
        primary=production_model,
        shadow=shadow_endpoint,
        use_shadow_predictions=False  # Log only
    )

    # Compare predictions for 1 week
    for day in range(7):
        comparison = compare_predictions(production_model, shadow_endpoint)

        log_comparison_metrics(comparison)

        # Check agreement rate
        if comparison['agreement_rate'] < 0.95:
            alert_team(f"Models disagree {1-comparison['agreement_rate']:.1%} of time")

    # Manual review before promoting
    if manual_approval():
        promote_to_production(shadow_endpoint)

5. Monitoring & Observability

Model Performance Monitoring

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
prediction_counter = Counter('predictions_total', 'Total predictions made')
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')
model_accuracy = Gauge('model_accuracy', 'Current model accuracy')
prediction_confidence = Histogram('prediction_confidence', 'Prediction confidence scores')

def predict_with_monitoring(model, features):
    """Prediction with comprehensive monitoring"""
    with prediction_latency.time():
        # Make prediction
        prediction = model.predict(features)
        confidence = model.predict_proba(features).max()

    # Record metrics
    prediction_counter.inc()
    prediction_confidence.observe(confidence)

    # Log prediction (for later analysis)
    log_prediction(
        timestamp=datetime.now(),
        features=features,
        prediction=prediction,
        confidence=confidence,
        model_version=model.version
    )

    return prediction

Data Drift Detection

from scipy.stats import ks_2samp

def detect_feature_drift(reference_data, current_data, threshold=0.05):
    """
    Detect distribution drift using Kolmogorov-Smirnov test
    """
    drift_detected = {}

    for feature in reference_data.columns:
        # Compare distributions
        statistic, p_value = ks_2samp(
            reference_data[feature],
            current_data[feature]
        )

        # Flag if significantly different
        if p_value < threshold:
            drift_detected[feature] = {
                'statistic': statistic,
                'p_value': p_value,
                'severity': 'high' if p_value < 0.01 else 'medium'
            }

    if drift_detected:
        alert_team(f"Drift detected in features: {list(drift_detected.keys())}")
        trigger_model_retraining()

    return drift_detected

Concept Drift Monitoring

def monitor_concept_drift(window_size=1000):
    """
    Monitor if relationship between features and target is changing
    """
    # Get recent predictions and actual labels
    recent_data = get_recent_predictions_with_labels(window_size)

    # Calculate accuracy on recent data
    recent_accuracy = calculate_accuracy(recent_data)

    # Compare to training accuracy
    if recent_accuracy < training_accuracy * 0.90:  # 10% degradation
        alert_team(f"Model accuracy degraded: {recent_accuracy:.2%} vs {training_accuracy:.2%}")
        trigger_model_retraining()

    # Track over time
    log_metric('production_accuracy', recent_accuracy)

MLOps Tools & Platforms

Experiment Tracking

MLflow

Open-source, popular
Tracks experiments, models, artifacts
Model registry
Deployment integrations

Weights & Biases

Beautiful visualizations
Team collaboration
Hyperparameter tuning
Model management

Neptune.ai

Metadata management
Experiment comparison
Model registry
Integrates with major frameworks

Pipeline Orchestration

Airflow: Most mature, complex workflows Kubeflow: Kubernetes-native, ML-focused Prefect: Modern Python-first approach Metaflow: Netflix’s framework, great UX

Model Serving

TensorFlow Serving: TensorFlow models, high performance TorchServe: PyTorch models, production-ready Seldon Core: Kubernetes, language-agnostic BentoML: Python-friendly, easy deployment

Feature Stores

Feast: Open-source, flexible Tecton: Enterprise, managed AWS SageMaker Feature Store: AWS-integrated Databricks Feature Store: Databricks-native

Best Practices

1. Reproducibility

Make Everything Reproducible

Pin dependency versions (requirements.txt with exact versions)
Set random seeds in code
Version data alongside code
Document infrastructure requirements
Use containerization (Docker)

2. Automated Testing

Test Coverage Goals

Unit tests: 80%+ code coverage
Integration tests: Critical paths
Model performance tests: Every training run
Data quality tests: Every data refresh

3. Documentation

Essential Documentation

Model cards (purpose, training data, limitations)
API documentation (input/output schemas)
Deployment guides
Monitoring runbooks
Incident response procedures

4. Security

ML-Specific Security

Secure model artifacts (prevent theft)
Input validation (prevent adversarial attacks)
Output filtering (prevent sensitive data leakage)
Access controls (who can deploy models)
Audit logging (track model usage)

Maturity Model

Level 0: Manual Process

Notebooks on data scientist laptops
Manual deployment
No monitoring
Time to production: Months

Level 1: Automated Training

Automated training pipelines
Version control for code
Basic testing
Time to production: Weeks

Level 2: Automated Deployment

CI/CD for models
Model registry
A/B testing capability
Time to production: Days

Level 3: Full MLOps

Automated monitoring and retraining
Feature stores
Data quality checks
Automated rollback
Time to production: Hours

Conclusion

MLOps transforms machine learning from experimental projects to reliable production systems. By applying software engineering best practices—version control, automated testing, CI/CD, monitoring—to ML workflows, organizations can deploy models faster, reduce failures, and maintain high-quality AI systems at scale.

The key is starting with foundational practices (version control, basic pipelines) and progressively adding sophistication as your ML maturity grows.

Next Steps:

Implement version control for code, data, and models
Build automated training pipeline for one key model
Add comprehensive testing (data, model, integration)
Deploy with basic monitoring and alerting
Iterate toward full MLOps maturity