How to Autoscale Kubernetes
How to Autoscale Kubernetes Autoscaling in Kubernetes is a fundamental capability that enables applications to dynamically adjust their resource consumption based on real-time demand. As cloud-native architectures become the standard for modern applications, the ability to automatically scale compute resources—both pods and underlying nodes—ensures optimal performance, cost-efficiency, and resilie
How to Autoscale Kubernetes
Autoscaling in Kubernetes is a fundamental capability that enables applications to dynamically adjust their resource consumption based on real-time demand. As cloud-native architectures become the standard for modern applications, the ability to automatically scale compute resourcesboth pods and underlying nodesensures optimal performance, cost-efficiency, and resilience. Without autoscaling, teams face the challenge of over-provisioning resources to handle peak loads, leading to unnecessary expenses, or under-provisioning, resulting in degraded user experience and service outages.
Kubernetes autoscaling operates at two primary levels: the workload level (Horizontal Pod Autoscaler and Vertical Pod Autoscaler) and the infrastructure level (Cluster Autoscaler). Together, these components form a comprehensive autoscaling strategy that responds to metrics such as CPU utilization, memory pressure, custom application metrics, and external events like queue lengths or HTTP request rates.
This guide provides a complete, step-by-step tutorial on how to autoscale Kubernetes clusters effectively. Whether you're managing a small microservice deployment or a large-scale enterprise application, understanding and implementing autoscaling correctly will significantly improve your systems reliability and operational efficiency. By the end of this tutorial, youll have the knowledge to configure, monitor, and optimize autoscaling policies tailored to your workloads unique requirements.
Step-by-Step Guide
Prerequisites
Before configuring autoscaling, ensure your Kubernetes environment meets the following requirements:
- A running Kubernetes cluster (version 1.19 or higher recommended)
- kubectl installed and configured to communicate with your cluster
- Metrics Server deployed to collect resource usage data
- Appropriate RBAC permissions to create Horizontal Pod Autoscalers (HPA), Vertical Pod Autoscalers (VPA), and Cluster Autoscaler resources
- Cloud provider support (if using cloud-based Cluster Autoscaler) such as AWS, GCP, Azure, or DigitalOcean
To verify Metrics Server is running, execute:
kubectl get pods -n kube-system | grep metrics-server
If no output appears, deploy Metrics Server using:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Step 1: Configure Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, stateful set, or replica set based on observed CPU utilization or custom metrics.
First, deploy a sample application. For this example, well use a simple Nginx deployment:
kubectl create deployment nginx-app --image=nginx:latest
Expose the deployment as a service:
kubectl expose deployment nginx-app --port=80 --type=ClusterIP
Now, create an HPA that scales the deployment between 2 and 10 replicas, targeting 70% CPU utilization:
kubectl autoscale deployment nginx-app --cpu-percent=70 --min=2 --max=10
Alternatively, define the HPA using a YAML manifest for greater control:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 10
periodSeconds: 15
Apply the manifest:
kubectl apply -f nginx-hpa.yaml
The behavior section fine-tunes scaling speed. Scaling up aggressively (100% per 15 seconds) allows rapid response to traffic spikes, while scaling down conservatively (10% per 15 seconds) prevents thrashing during temporary load dips.
Step 2: Monitor HPA Status
Check the current status of your HPA:
kubectl get hpa
Output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
nginx-hpa Deployment/nginx-app 45%/70% 2 10 2 5m
To view detailed events and metrics:
kubectl describe hpa nginx-hpa
Look for conditions such as ValidMetricFound, EnoughReplicas, and ScalingActive. If the HPA is not scaling, common issues include missing Metrics Server, insufficient resource requests, or misconfigured target metrics.
Step 3: Enable Custom Metrics with Prometheus
For advanced use cases, such as scaling based on HTTP request rate, queue depth, or database connection counts, use custom metrics via Prometheus and the Prometheus Adapter.
Install Prometheus using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
Install the Prometheus Adapter:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter --set "prometheus.url=http://prometheus-operated.prometheus.svc.cluster.local" --set "prometheus.port=9090"
Verify the adapter is exposing custom metrics:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
Now create an HPA that scales based on HTTP requests per second:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
This configuration scales the deployment when the average HTTP requests per second across all pods exceeds 100. Ensure your application exposes this metric via a sidecar or instrumentation library like Prometheus Client.
Step 4: Implement Vertical Pod Autoscaler (VPA)
While HPA adjusts the number of pods, the Vertical Pod Autoscaler (VPA) adjusts the CPU and memory requests and limits of individual pods. This is particularly useful for applications with inconsistent or unpredictable resource usage patterns.
Deploy the VPA operator:
kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml
Wait for the VPA pods to be ready:
kubectl get pods -n kube-system | grep vpa
Create a VPA resource targeting your deployment:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: nginx-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: nginx-app
updatePolicy:
updateMode: "Auto"
Apply it:
kubectl apply -f nginx-vpa.yaml
VPA operates in two modes: Off (recommends only), Initial (applies only on pod creation), and Auto (recommends and applies changes on pod restart). Use Auto with caution in productiontest in staging first.
Check recommendations:
kubectl get vpa nginx-vpa -o yaml
Look under status.recommendation.containerRecommendations for suggested CPU and memory values. VPA does not immediately change running podsit updates them during the next restart or rollout.
Step 5: Configure Cluster Autoscaler
Cluster Autoscaler (CA) automatically adjusts the number of nodes in your node pool based on pending pods and node utilization. It works in conjunction with HPA and VPA to ensure sufficient underlying infrastructure exists to support scaled workloads.
Cluster Autoscaler configuration varies by cloud provider. Below are examples for AWS EKS, GCP GKE, and Azure AKS.
AWS EKS
Install Cluster Autoscaler using Helm:
helm repo add eks https://aws.github.io/eks-charts
helm install cluster-autoscaler eks/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=your-eks-cluster-name \
--set awsRegion=us-east-1 \
--set rbac.create=true \
--set image.repository=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-state-metrics:v2.10.1
Alternatively, use the YAML manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/your-eks-cluster-name
env:
- name: AWS_REGION
value: us-east-1
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-bundle.crt
GCP GKE
Enable Cluster Autoscaler via the GCP Console or gcloud CLI:
gcloud container clusters update your-cluster-name \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--zone=us-central1-a
Azure AKS
az aks nodepool update \
--cluster-name your-aks-cluster \
--resource-group your-resource-group \
--name nodepool1 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10
Once configured, Cluster Autoscaler monitors for pods in Pending state due to insufficient resources. When detected, it adds nodes from the configured node pool. When nodes are underutilized for a sustained period (default 10 minutes), it removes them.
Step 6: Integrate with Pod Disruption Budgets (PDB)
To prevent service disruption during autoscaling events, especially during node draining, define a Pod Disruption Budget (PDB). A PDB ensures a minimum number of pods remain available during voluntary disruptions.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: nginx-app
Apply it:
kubectl apply -f nginx-pdb.yaml
This ensures that even during scale-down or node maintenance, at least one instance of the nginx-app remains running, maintaining service continuity.
Best Practices
Set Appropriate Resource Requests and Limits
Autoscaling depends on accurate resource requests. If requests are too low, the scheduler may overcommit nodes, leading to resource contention. If too high, pods may never schedule, causing HPA to scale unnecessarily. Use tools like kubectl top pods and historical telemetry to set realistic values.
Use Different Scaling Policies for Scale-Up and Scale-Down
Scale-up should be aggressive to handle sudden traffic spikes (e.g., 50100% per minute). Scale-down should be conservative to avoid thrashingrapidly scaling up and down due to transient load fluctuations. Use the behavior field in HPA to define separate policies.
Avoid Scaling Based on Memory Alone
Memory usage is often not a reliable autoscaling metric because it tends to grow over time due to caching and leaks. Prefer CPU or application-specific metrics like request latency or throughput. If using memory, pair it with a VPA to adjust limits over time.
Use Multiple Metrics for Stable Scaling
Combine multiple metrics (e.g., CPU + HTTP requests) using the type: Pods or type: Object in HPA to create a more robust scaling trigger. This prevents false positives from a single metric anomaly.
Test Autoscaling in Staging
Always validate autoscaling behavior in a non-production environment. Simulate traffic spikes using tools like k6, locust, or hey to observe scaling latency, node provisioning time, and pod startup delays.
Monitor Scaling Events and Alerts
Integrate HPA and Cluster Autoscaler events into your observability stack. Use Prometheus alerts for:
- HPA not scaling due to missing metrics
- Cluster Autoscaler unable to add nodes (e.g., quota limits)
- Pods pending for more than 5 minutes
Enable Node Affinity and Taints for Workload Isolation
Use node affinity rules to ensure critical workloads (e.g., databases) are scheduled on dedicated nodes not subject to autoscaling. Use taints and tolerations to prevent non-critical workloads from disrupting stable nodes.
Regularly Review and Update Autoscaling Policies
Application behavior changes over time. Re-evaluate HPA targets, VPA recommendations, and Cluster Autoscaler thresholds every 24 weeks. Use historical metrics to refine your thresholds.
Consider Cost Implications
Autoscaling can increase cloud costs if not managed carefully. Use spot instances for stateless workloads, implement scheduled scaling (e.g., scale down overnight), and consider using Kubernetes Cost Explorer or Kubecost to track spending per deployment.
Tools and Resources
Core Kubernetes Components
- Metrics Server Collects resource usage data from kubelets
- Horizontal Pod Autoscaler (HPA) Scales pod replicas based on metrics
- Vertical Pod Autoscaler (VPA) Adjusts pod resource requests and limits
- Cluster Autoscaler Adds or removes nodes based on scheduling pressure
Third-Party Tools
- Prometheus + Prometheus Adapter Enables custom metric-based autoscaling
- Kubecost Monitors cost per namespace, deployment, and autoscaling event
- Datadog / New Relic / Grafana Cloud Advanced monitoring and alerting for autoscaling triggers
- Argo Rollouts Canary deployments with autoscaling integration
- Flux / Argo CD GitOps tools to manage autoscaling configurations as code
Documentation and References
- Kubernetes HPA Documentation
- VPA GitHub Repository
- Cluster Autoscaler GitHub Repository
- Prometheus Query Language (PromQL) Guide
- LearnK8s Autoscaling Guide
Sample Scripts and Templates
Use these templates as starting points:
- HPA with Custom Metric Scale based on Prometheus query
- VPA with Recommendations Only Test before enabling auto-updates
- Cluster Autoscaler for Multi-AZ Ensures high availability during node provisioning
- CI/CD Integration Auto-deploy HPA changes via GitOps
Real Examples
Example 1: E-commerce Site During Black Friday
A retail company runs a Kubernetes cluster on AWS EKS hosting a microservice architecture for their online store. During Black Friday, traffic increases 10x from baseline.
- HPA configured to scale the product catalog service from 4 to 50 replicas based on CPU and HTTP request rate (via Prometheus).
- Cluster Autoscaler adds 15 additional m5.large nodes from a spot instance pool to accommodate the surge.
- VPA increases memory requests for the cart service from 256Mi to 512Mi as session data grows.
- PDB ensures at least 80% of product catalog pods remain available during node drain.
Result: The site handles 500K concurrent users with 99.98% uptime. Post-event, autoscaling reduces nodes to baseline, saving 65% in cloud costs.
Example 2: Real-Time Analytics Platform
A SaaS company processes real-time log data using a Kafka-based ingestion pipeline deployed on GKE.
- HPA scales consumer pods based on Kafka lag (custom metric via Prometheus Adapter).
- When lag exceeds 10,000 messages, HPA scales up by 5 pods every 2 minutes.
- Cluster Autoscaler adds n1-standard-4 nodes when pending pods exceed 5.
- VPA adjusts memory limits dynamically as data payloads vary by hour.
Result: Processing latency remains under 2 seconds during peak ingestion. Without autoscaling, latency would have exceeded 15 minutes.
Example 3: Internal Dev Tools with Scheduled Scaling
A startup runs internal CI/CD tools (Jenkins, SonarQube) on a small AKS cluster. Usage is high during business hours and near-zero overnight.
- HPA scales Jenkins agents from 1 to 10 based on queue length.
- Cluster Autoscaler enabled with min=2, max=8.
- External Scheduler uses a cron job to scale down node pool to 1 node at 7 PM and scale up to 5 at 8 AM.
Result: Monthly cloud costs reduced by 40% without impacting developer productivity.
FAQs
Whats the difference between HPA and VPA?
HPA scales the number of pod replicas horizontallyadding or removing instances. VPA adjusts the CPU and memory resources allocated to each individual pod verticallyincreasing or decreasing the request and limit values.
Can I use HPA and VPA together?
Yes, but with caution. HPA and VPA can conflict if VPA changes resource requests while HPA is scaling. Use VPA in Initial or Off mode in production, or use VPA only for long-term trend adjustments and HPA for real-time scaling.
Why isnt my HPA scaling?
Common reasons include:
- Metrics Server not running or unreachable
- Pods lack resource requests
- Target metric is unreachable (e.g., custom Prometheus metric not exposed)
- HPA is in
FailedConditionstatecheckkubectl describe hpa - Pods are in CrashLoopBackOff or Pending state
How long does Cluster Autoscaler take to add a node?
Typically 15 minutes, depending on cloud provider and node image provisioning time. Spot instances may take longer due to availability constraints.
Does autoscaling work with StatefulSets?
Yes, HPA supports StatefulSets. However, VPA has limited support for StatefulSets due to the complexity of preserving stateful data during resource changes. Use HPA with StatefulSets for replica scaling.
Can I autoscale based on external events like GitHub commits or Slack messages?
Yes, using custom metrics. For example, a webhook can push commit count to Prometheus, and HPA can scale based on that metric. Tools like KEDA (Kubernetes Event-Driven Autoscaling) automate this process.
What is KEDA?
KEDA (Kubernetes Event-Driven Autoscaling) is a lightweight, open-source component that enables event-driven autoscaling for any Kubernetes workload. It supports over 40 event sources including Kafka, RabbitMQ, Azure Queues, GitHub, and more. KEDA can replace or enhance HPA for complex, event-based scaling scenarios.
Is autoscaling expensive?
It can be, if misconfigured. Overly aggressive scale-up or slow scale-down increases costs. Use cost monitoring tools, set max replicas, and combine with scheduled scaling or spot instances to optimize spend.
Should I use autoscaling for stateful applications like databases?
Generally, no. Databases like PostgreSQL or MongoDB are not designed for horizontal scaling. Use vertical scaling (VPA) cautiously, and prefer managed database services with built-in scaling. Avoid autoscaling databases unless youre using a distributed system like Vitess or CockroachDB.
How do I rollback a bad autoscaling configuration?
Use GitOps tools like Argo CD or Flux to version-control your HPA, VPA, and Cluster Autoscaler manifests. If an update causes issues, revert the Git commit and let the operator restore the previous configuration.
Conclusion
Autoscaling Kubernetes is not a single featureits a coordinated strategy that combines Horizontal Pod Autoscaling, Vertical Pod Autoscaling, Cluster Autoscaling, and custom metrics to create a self-optimizing infrastructure. When implemented correctly, it delivers resilience against traffic surges, reduces operational overhead, and lowers cloud costs by aligning resource allocation with actual demand.
This guide provided a comprehensive, practical walkthroughfrom deploying Metrics Server and configuring HPA to integrating with Prometheus and Cluster Autoscaler. Real-world examples demonstrated how enterprises leverage autoscaling to handle everything from Black Friday traffic to real-time data pipelines.
Remember: autoscaling thrives on accurate metrics, thoughtful thresholds, and disciplined monitoring. Avoid the trap of set it and forget it. Regularly review scaling behavior, validate against performance benchmarks, and refine policies as your applications evolve.
By mastering these techniques, you transform Kubernetes from a static orchestration platform into a dynamic, intelligent system that adapts to your workloads needsensuring optimal performance, availability, and efficiency at every scale.