DevOps•April 5, 2026
Kubernetes in Production: Lessons from Scaling Microservices
Battle-tested patterns for running Kubernetes at scale — from resource tuning and pod autoscaling to zero-downtime deployments and observability pipelines.
Kubernetes has become the de facto orchestration layer for production microservices, but running it reliably at scale requires far more than writing a few YAML manifests. After operating Kubernetes clusters serving millions of requests daily, we've distilled the patterns that separate hobby deployments from production-grade infrastructure.
🏗️ Cluster Architecture
A production Kubernetes setup typically involves:- Control Plane High Availability: Running at least 3 control plane nodes across availability zones with etcd replication to prevent single points of failure.
- Node Pools: Separate node groups for different workload types — compute-optimized for API servers, memory-optimized for caching layers, and GPU nodes for ML inference.
- Namespace Isolation: Using namespaces with
ResourceQuotasandLimitRangesto prevent noisy-neighbor effects between teams.
⚡ Pod Autoscaling Done Right
The Horizontal Pod Autoscaler (HPA) is powerful but needs careful tuning. Default CPU-based scaling often reacts too slowly for bursty traffic. We recommend:- Using custom metrics from Prometheus (e.g., requests-per-second, queue depth) via the prometheus-adapter.
- Setting
stabilizationWindowSecondsto prevent flapping during traffic spikes. - Combining HPA with Vertical Pod Autoscaler (VPA) in recommendation mode to right-size resource requests over time.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100🔄 Zero-Downtime Deployments
Rolling deployments are the default, but achieving true zero-downtime requires attention to detail:- Readiness Probes: Ensure new pods only receive traffic after they've fully initialized — including database connection pool warm-up and cache priming.
- PodDisruptionBudgets: Guarantee a minimum number of available replicas during voluntary disruptions like node drains.
- Graceful Shutdown: Handle
SIGTERMsignals in your application to finish in-flight requests before terminating. SetterminationGracePeriodSecondsappropriately.
📊 Observability Stack
You can't manage what you can't measure. Our recommended observability stack:- Metrics: Prometheus + Grafana for cluster and application metrics with pre-built dashboards.
- Logging: Loki or the EFK stack (Elasticsearch, Fluentd, Kibana) for centralized log aggregation.
- Tracing: Jaeger or OpenTelemetry for distributed request tracing across microservices.
🛡️ Security Hardening
Production clusters must enforce:- Network Policies: Restrict pod-to-pod communication using Calico or Cilium CNI plugins.
- RBAC: Limit
kubectlaccess using fine-grained ClusterRoles and ServiceAccounts — never give developerscluster-admin. - Image Scanning: Integrate tools like Trivy into your CI pipeline to catch CVEs before deployment.
- Pod Security Standards: Enforce non-root containers and read-only filesystems using the built-in Pod Security Admission controller.
