Kubernetes in Production: Complete Deployment Guide
Master production-grade Kubernetes deployments with best practices for high availability, security, monitoring, and cost optimization. Real-world patterns for enterprise workloads.
Kubernetes in Production: Complete Deployment Guide
Kubernetes powers production workloads for 90% of Fortune 500 companies. However, moving from development clusters to production-ready infrastructure requires careful planning across networking, security, monitoring, and cost management.
Production Architecture Patterns
Multi-Region High Availability
Design for zero-downtime deployments and disaster recovery:
Control Plane:
- 3+ master nodes across availability zones
- Managed services (EKS, GKE, AKS) for control plane HA
- etcd cluster with automated backups
- Load balancing across API servers
Worker Nodes:
- Node pools spanning multiple AZs
- Auto-scaling groups with min/max limits
- Mixed instance types for cost optimization
- Spot instances for fault-tolerant workloads
Network Architecture
Production-grade networking requires careful planning:
Ingress Strategy:
- AWS ALB/NLB with Ingress Controllers
- NGINX or Traefik for advanced routing
- TLS termination at load balancer
- Web Application Firewall (WAF) integration
Service Mesh (for complex microservices):
- Istio or Linkerd for traffic management
- mTLS for pod-to-pod encryption
- Advanced routing (canary, blue-green)
- Distributed tracing integration
Network Policies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Security Hardening
Identity and Access Management
Implement least-privilege access:
RBAC Configuration:
- Separate namespaces per environment/team
- Role-based access control for developers
- Service accounts with minimal permissions
- Regular access reviews and audits
Pod Security:
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Secrets Management
Never store secrets in code or ConfigMaps:
External Secrets Operator:
- Integrate with AWS Secrets Manager, Vault, Azure Key Vault
- Automatic secret rotation
- Encryption at rest and in transit
Sealed Secrets:
- GitOps-friendly secret encryption
- Decrypt only in-cluster
- Public key encryption for developers
Image Security
Scan and verify container images:
- Automated vulnerability scanning (Trivy, Snyk)
- Image signing and verification (Sigstore, Notary)
- Private registries with access controls
- Admission controllers blocking vulnerable images
Resource Management
Requests and Limits
Critical for stability and cost control:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Best Practices:
- Set requests based on actual usage (p95)
- Limits prevent resource starvation
- Use Vertical Pod Autoscaler for tuning
- Monitor OOM kills and throttling
Horizontal Pod Autoscaling
Scale based on metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Cluster Autoscaling
Automatically adjust node capacity:
- Cluster Autoscaler for node scaling
- Karpenter for intelligent provisioning (AWS)
- Scale down grace periods
- Pod disruption budgets
Deployment Strategies
Blue-Green Deployments
Zero-downtime releases with instant rollback:
- Deploy new version (green) alongside current (blue)
- Test green environment thoroughly
- Switch traffic from blue to green
- Keep blue running for quick rollback
Canary Deployments
Gradual rollout with monitoring:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-canary
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
progressDeadlineSeconds: 600
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
Progressive Delivery
Combine canary with feature flags:
- Flagger for automated canary analysis
- LaunchDarkly/Flagsmith for feature toggles
- Gradual user exposure
- Automated rollback on anomalies
Monitoring and Observability
Metrics Stack
Prometheus + Grafana for comprehensive monitoring:
Infrastructure Metrics:
- Node CPU, memory, disk utilization
- Network I/O and latency
- Pod resource usage
- Persistent volume metrics
Application Metrics:
- Request rates and latencies (RED method)
- Error rates by endpoint
- Database connection pools
- Custom business metrics
Sample Dashboard Queries:
# Request rate
rate(http_requests_total[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Logging
Centralized log aggregation:
EFK Stack (Elasticsearch, Fluentd, Kibana):
- Structured logging with JSON
- Log retention policies
- Index lifecycle management
- Role-based access to logs
Alternatives: Loki (cost-effective), CloudWatch Logs, Datadog
Distributed Tracing
Debug complex microservice interactions:
- Jaeger or Tempo for trace storage
- OpenTelemetry for instrumentation
- Automatic context propagation
- End-to-end request visibility
Disaster Recovery
Backup Strategy
Protect against data loss and cluster failures:
Velero for Cluster Backups:
- Scheduled backups of resources and volumes
- Cross-region replication
- Restore to different clusters
- Disaster recovery testing
Database Backups:
- Automated snapshots with retention policies
- Point-in-time recovery capability
- Encrypted backups
- Regular restore testing
Chaos Engineering
Proactively test resilience:
- Chaos Mesh or Litmus for experiment orchestration
- Pod deletion tests
- Network latency injection
- Resource exhaustion scenarios
Cost Optimization
Right-Sizing:
- Vertical Pod Autoscaler recommendations
- Remove resource over-provisioning
- Consolidate under-utilized nodes
Spot Instances:
- Use for stateless workloads
- Karpenter for spot provisioning
- Graceful handling of terminations
Reserved Capacity:
- Savings Plans or Reserved Instances for baseline
- Commit for 1-3 years on predictable workloads
- 30-50% cost reduction
Monitoring Costs:
- Kubecost for cost allocation
- Namespace-level budgets
- Chargeback to teams
- Idle resource detection
Production Checklist
Before Launch:
- Multi-AZ control plane and workers
- RBAC configured with least privilege
- Network policies enforcing pod communication rules
- Secrets managed externally (Vault, AWS Secrets Manager)
- Resource requests/limits on all pods
- HPA configured for variable load
- Prometheus + Grafana monitoring
- Centralized logging with retention policies
- Automated backups with tested restore procedures
- Ingress with TLS termination
- Deployment pipelines with automated testing
- Runbooks for common incident scenarios
- Cost monitoring and budgets
Getting Started
Week 1-2: Infrastructure Setup
- Provision managed Kubernetes (EKS, GKE, AKS)
- Configure networking (VPC, subnets, security groups)
- Set up node groups with autoscaling
Week 3-4: Security and Observability
- Implement RBAC and network policies
- Deploy monitoring stack (Prometheus, Grafana)
- Configure centralized logging
Week 5-6: Application Deployment
- Deploy applications with proper resources
- Set up HPA and PDBs
- Configure ingress and TLS
Week 7-8: Hardening and Testing
- Backup and disaster recovery testing
- Load testing and capacity planning
- Security scanning and remediation
- Documentation and runbooks
Kubernetes in production requires expertise across multiple domains. Partner with experienced teams to accelerate time-to-production while avoiding costly mistakes.
Ready to Transform Your Business?
Let's discuss how our AI and technology solutions can drive revenue growth for your organization.