Skip to main content
Kubernetes cluster architecture diagram with pods and services
Cloud Infrastructure

Kubernetes in Production: Complete Deployment Guide

Cesar Adames

Master production-grade Kubernetes deployments with best practices for high availability, security, monitoring, and cost optimization. Real-world patterns for enterprise workloads.

#kubernetes #devops #cloud-infrastructure #containers #orchestration

Kubernetes in Production: Complete Deployment Guide

Kubernetes powers production workloads for 90% of Fortune 500 companies. However, moving from development clusters to production-ready infrastructure requires careful planning across networking, security, monitoring, and cost management.

Production Architecture Patterns

Multi-Region High Availability

Design for zero-downtime deployments and disaster recovery:

Control Plane:

  • 3+ master nodes across availability zones
  • Managed services (EKS, GKE, AKS) for control plane HA
  • etcd cluster with automated backups
  • Load balancing across API servers

Worker Nodes:

  • Node pools spanning multiple AZs
  • Auto-scaling groups with min/max limits
  • Mixed instance types for cost optimization
  • Spot instances for fault-tolerant workloads

Network Architecture

Production-grade networking requires careful planning:

Ingress Strategy:

  • AWS ALB/NLB with Ingress Controllers
  • NGINX or Traefik for advanced routing
  • TLS termination at load balancer
  • Web Application Firewall (WAF) integration

Service Mesh (for complex microservices):

  • Istio or Linkerd for traffic management
  • mTLS for pod-to-pod encryption
  • Advanced routing (canary, blue-green)
  • Distributed tracing integration

Network Policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Security Hardening

Identity and Access Management

Implement least-privilege access:

RBAC Configuration:

  • Separate namespaces per environment/team
  • Role-based access control for developers
  • Service accounts with minimal permissions
  • Regular access reviews and audits

Pod Security:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

Secrets Management

Never store secrets in code or ConfigMaps:

External Secrets Operator:

  • Integrate with AWS Secrets Manager, Vault, Azure Key Vault
  • Automatic secret rotation
  • Encryption at rest and in transit

Sealed Secrets:

  • GitOps-friendly secret encryption
  • Decrypt only in-cluster
  • Public key encryption for developers

Image Security

Scan and verify container images:

  • Automated vulnerability scanning (Trivy, Snyk)
  • Image signing and verification (Sigstore, Notary)
  • Private registries with access controls
  • Admission controllers blocking vulnerable images

Resource Management

Requests and Limits

Critical for stability and cost control:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Best Practices:

  • Set requests based on actual usage (p95)
  • Limits prevent resource starvation
  • Use Vertical Pod Autoscaler for tuning
  • Monitor OOM kills and throttling

Horizontal Pod Autoscaling

Scale based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Cluster Autoscaling

Automatically adjust node capacity:

  • Cluster Autoscaler for node scaling
  • Karpenter for intelligent provisioning (AWS)
  • Scale down grace periods
  • Pod disruption budgets

Deployment Strategies

Blue-Green Deployments

Zero-downtime releases with instant rollback:

  1. Deploy new version (green) alongside current (blue)
  2. Test green environment thoroughly
  3. Switch traffic from blue to green
  4. Keep blue running for quick rollback

Canary Deployments

Gradual rollout with monitoring:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-canary
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  progressDeadlineSeconds: 600
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
    - name: request-duration
      thresholdRange:
        max: 500

Progressive Delivery

Combine canary with feature flags:

  • Flagger for automated canary analysis
  • LaunchDarkly/Flagsmith for feature toggles
  • Gradual user exposure
  • Automated rollback on anomalies

Monitoring and Observability

Metrics Stack

Prometheus + Grafana for comprehensive monitoring:

Infrastructure Metrics:

  • Node CPU, memory, disk utilization
  • Network I/O and latency
  • Pod resource usage
  • Persistent volume metrics

Application Metrics:

  • Request rates and latencies (RED method)
  • Error rates by endpoint
  • Database connection pools
  • Custom business metrics

Sample Dashboard Queries:

# Request rate
rate(http_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Logging

Centralized log aggregation:

EFK Stack (Elasticsearch, Fluentd, Kibana):

  • Structured logging with JSON
  • Log retention policies
  • Index lifecycle management
  • Role-based access to logs

Alternatives: Loki (cost-effective), CloudWatch Logs, Datadog

Distributed Tracing

Debug complex microservice interactions:

  • Jaeger or Tempo for trace storage
  • OpenTelemetry for instrumentation
  • Automatic context propagation
  • End-to-end request visibility

Disaster Recovery

Backup Strategy

Protect against data loss and cluster failures:

Velero for Cluster Backups:

  • Scheduled backups of resources and volumes
  • Cross-region replication
  • Restore to different clusters
  • Disaster recovery testing

Database Backups:

  • Automated snapshots with retention policies
  • Point-in-time recovery capability
  • Encrypted backups
  • Regular restore testing

Chaos Engineering

Proactively test resilience:

  • Chaos Mesh or Litmus for experiment orchestration
  • Pod deletion tests
  • Network latency injection
  • Resource exhaustion scenarios

Cost Optimization

Right-Sizing:

  • Vertical Pod Autoscaler recommendations
  • Remove resource over-provisioning
  • Consolidate under-utilized nodes

Spot Instances:

  • Use for stateless workloads
  • Karpenter for spot provisioning
  • Graceful handling of terminations

Reserved Capacity:

  • Savings Plans or Reserved Instances for baseline
  • Commit for 1-3 years on predictable workloads
  • 30-50% cost reduction

Monitoring Costs:

  • Kubecost for cost allocation
  • Namespace-level budgets
  • Chargeback to teams
  • Idle resource detection

Production Checklist

Before Launch:

  • Multi-AZ control plane and workers
  • RBAC configured with least privilege
  • Network policies enforcing pod communication rules
  • Secrets managed externally (Vault, AWS Secrets Manager)
  • Resource requests/limits on all pods
  • HPA configured for variable load
  • Prometheus + Grafana monitoring
  • Centralized logging with retention policies
  • Automated backups with tested restore procedures
  • Ingress with TLS termination
  • Deployment pipelines with automated testing
  • Runbooks for common incident scenarios
  • Cost monitoring and budgets

Getting Started

Week 1-2: Infrastructure Setup

  • Provision managed Kubernetes (EKS, GKE, AKS)
  • Configure networking (VPC, subnets, security groups)
  • Set up node groups with autoscaling

Week 3-4: Security and Observability

  • Implement RBAC and network policies
  • Deploy monitoring stack (Prometheus, Grafana)
  • Configure centralized logging

Week 5-6: Application Deployment

  • Deploy applications with proper resources
  • Set up HPA and PDBs
  • Configure ingress and TLS

Week 7-8: Hardening and Testing

  • Backup and disaster recovery testing
  • Load testing and capacity planning
  • Security scanning and remediation
  • Documentation and runbooks

Kubernetes in production requires expertise across multiple domains. Partner with experienced teams to accelerate time-to-production while avoiding costly mistakes.

Ready to Transform Your Business?

Let's discuss how our AI and technology solutions can drive revenue growth for your organization.