How to Autoscale Kubernetes

How to Autoscale Kubernetes Autoscaling in Kubernetes is a fundamental capability that enables applications to dynamically adjust their resource consumption based on real-time demand. As cloud-native architectures become the standard for modern applications, the ability to automatically scale compute resources—both pods and underlying nodes—ensures optimal performance, cost-efficiency, and resilie

alex

Nov 6, 2025 - 19:27

How to Autoscale Kubernetes

Autoscaling in Kubernetes is a fundamental capability that enables applications to dynamically adjust their resource consumption based on real-time demand. As cloud-native architectures become the standard for modern applications, the ability to automatically scale compute resourcesboth pods and underlying nodesensures optimal performance, cost-efficiency, and resilience. Without autoscaling, teams face the challenge of over-provisioning resources to handle peak loads, leading to unnecessary expenses, or under-provisioning, resulting in degraded user experience and service outages.

Kubernetes autoscaling operates at two primary levels: the workload level (Horizontal Pod Autoscaler and Vertical Pod Autoscaler) and the infrastructure level (Cluster Autoscaler). Together, these components form a comprehensive autoscaling strategy that responds to metrics such as CPU utilization, memory pressure, custom application metrics, and external events like queue lengths or HTTP request rates.

This guide provides a complete, step-by-step tutorial on how to autoscale Kubernetes clusters effectively. Whether you're managing a small microservice deployment or a large-scale enterprise application, understanding and implementing autoscaling correctly will significantly improve your systems reliability and operational efficiency. By the end of this tutorial, youll have the knowledge to configure, monitor, and optimize autoscaling policies tailored to your workloads unique requirements.

Step-by-Step Guide

Prerequisites

Before configuring autoscaling, ensure your Kubernetes environment meets the following requirements:

A running Kubernetes cluster (version 1.19 or higher recommended)
kubectl installed and configured to communicate with your cluster
Metrics Server deployed to collect resource usage data
Appropriate RBAC permissions to create Horizontal Pod Autoscalers (HPA), Vertical Pod Autoscalers (VPA), and Cluster Autoscaler resources
Cloud provider support (if using cloud-based Cluster Autoscaler) such as AWS, GCP, Azure, or DigitalOcean

To verify Metrics Server is running, execute:

kubectl get pods -n kube-system | grep metrics-server

If no output appears, deploy Metrics Server using:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Step 1: Configure Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, stateful set, or replica set based on observed CPU utilization or custom metrics.

First, deploy a sample application. For this example, well use a simple Nginx deployment:

kubectl create deployment nginx-app --image=nginx:latest

Expose the deployment as a service:

kubectl expose deployment nginx-app --port=80 --type=ClusterIP

Now, create an HPA that scales the deployment between 2 and 10 replicas, targeting 70% CPU utilization:

kubectl autoscale deployment nginx-app --cpu-percent=70 --min=2 --max=10

Alternatively, define the HPA using a YAML manifest for greater control:

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: nginx-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleUp: stabilizationWindowSeconds: 300 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 600 policies: - type: Percent value: 10

periodSeconds: 15

Apply the manifest:

kubectl apply -f nginx-hpa.yaml

The behavior section fine-tunes scaling speed. Scaling up aggressively (100% per 15 seconds) allows rapid response to traffic spikes, while scaling down conservatively (10% per 15 seconds) prevents thrashing during temporary load dips.

Step 2: Monitor HPA Status

Check the current status of your HPA:

kubectl get hpa

Output:

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE

nginx-hpa Deployment/nginx-app 45%/70% 2 10 2 5m

To view detailed events and metrics:

kubectl describe hpa nginx-hpa

Look for conditions such as ValidMetricFound, EnoughReplicas, and ScalingActive. If the HPA is not scaling, common issues include missing Metrics Server, insufficient resource requests, or misconfigured target metrics.

Step 3: Enable Custom Metrics with Prometheus

For advanced use cases, such as scaling based on HTTP request rate, queue depth, or database connection counts, use custom metrics via Prometheus and the Prometheus Adapter.

Install Prometheus using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install prometheus prometheus-community/kube-prometheus-stack

Install the Prometheus Adapter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install prometheus-adapter prometheus-community/prometheus-adapter --set "prometheus.url=http://prometheus-operated.prometheus.svc.cluster.local" --set "prometheus.port=9090"

Verify the adapter is exposing custom metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .

Now create an HPA that scales based on HTTP requests per second:

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: nginx-custom-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx-app minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue

averageValue: "100"

This configuration scales the deployment when the average HTTP requests per second across all pods exceeds 100. Ensure your application exposes this metric via a sidecar or instrumentation library like Prometheus Client.

Step 4: Implement Vertical Pod Autoscaler (VPA)

While HPA adjusts the number of pods, the Vertical Pod Autoscaler (VPA) adjusts the CPU and memory requests and limits of individual pods. This is particularly useful for applications with inconsistent or unpredictable resource usage patterns.

Deploy the VPA operator:

kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml

Wait for the VPA pods to be ready:

kubectl get pods -n kube-system | grep vpa

Create a VPA resource targeting your deployment:

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: nginx-vpa spec: targetRef: apiVersion: "apps/v1" kind: Deployment name: nginx-app updatePolicy:

updateMode: "Auto"

Apply it:

kubectl apply -f nginx-vpa.yaml

VPA operates in two modes: Off (recommends only), Initial (applies only on pod creation), and Auto (recommends and applies changes on pod restart). Use Auto with caution in productiontest in staging first.

Check recommendations:

kubectl get vpa nginx-vpa -o yaml

Look under status.recommendation.containerRecommendations for suggested CPU and memory values. VPA does not immediately change running podsit updates them during the next restart or rollout.

Step 5: Configure Cluster Autoscaler

Cluster Autoscaler (CA) automatically adjusts the number of nodes in your node pool based on pending pods and node utilization. It works in conjunction with HPA and VPA to ensure sufficient underlying infrastructure exists to support scaled workloads.

Cluster Autoscaler configuration varies by cloud provider. Below are examples for AWS EKS, GCP GKE, and Azure AKS.

AWS EKS

Install Cluster Autoscaler using Helm:

helm repo add eks https://aws.github.io/eks-charts
helm install cluster-autoscaler eks/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=your-eks-cluster-name \
--set awsRegion=us-east-1 \
--set rbac.create=true \
--set image.repository=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-state-metrics:v2.10.1

Alternatively, use the YAML manifest:

apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system labels: app: cluster-autoscaler spec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler spec: serviceAccountName: cluster-autoscaler containers: - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0 name: cluster-autoscaler resources: limits: cpu: 100m memory: 300Mi requests: cpu: 100m memory: 300Mi command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/your-eks-cluster-name env: - name: AWS_REGION value: us-east-1 volumeMounts: - name: ssl-certs mountPath: /etc/ssl/certs/ca-certificates.crt readOnly: true volumes: - name: ssl-certs hostPath:

path: /etc/ssl/certs/ca-bundle.crt

GCP GKE

Enable Cluster Autoscaler via the GCP Console or gcloud CLI:

gcloud container clusters update your-cluster-name \ --enable-autoscaling \ --min-nodes=1 \ --max-nodes=10 \

--zone=us-central1-a

Azure AKS

az aks nodepool update \ --cluster-name your-aks-cluster \ --resource-group your-resource-group \ --name nodepool1 \ --enable-cluster-autoscaler \ --min-count 1 \

--max-count 10

Once configured, Cluster Autoscaler monitors for pods in Pending state due to insufficient resources. When detected, it adds nodes from the configured node pool. When nodes are underutilized for a sustained period (default 10 minutes), it removes them.

Step 6: Integrate with Pod Disruption Budgets (PDB)

To prevent service disruption during autoscaling events, especially during node draining, define a Pod Disruption Budget (PDB). A PDB ensures a minimum number of pods remain available during voluntary disruptions.

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: nginx-pdb spec: minAvailable: 1 selector: matchLabels:

app: nginx-app

Apply it:

kubectl apply -f nginx-pdb.yaml

This ensures that even during scale-down or node maintenance, at least one instance of the nginx-app remains running, maintaining service continuity.

Best Practices

Set Appropriate Resource Requests and Limits

Autoscaling depends on accurate resource requests. If requests are too low, the scheduler may overcommit nodes, leading to resource contention. If too high, pods may never schedule, causing HPA to scale unnecessarily. Use tools like kubectl top pods and historical telemetry to set realistic values.

Use Different Scaling Policies for Scale-Up and Scale-Down

Scale-up should be aggressive to handle sudden traffic spikes (e.g., 50100% per minute). Scale-down should be conservative to avoid thrashingrapidly scaling up and down due to transient load fluctuations. Use the behavior field in HPA to define separate policies.

Avoid Scaling Based on Memory Alone

Memory usage is often not a reliable autoscaling metric because it tends to grow over time due to caching and leaks. Prefer CPU or application-specific metrics like request latency or throughput. If using memory, pair it with a VPA to adjust limits over time.

Use Multiple Metrics for Stable Scaling

Combine multiple metrics (e.g., CPU + HTTP requests) using the type: Pods or type: Object in HPA to create a more robust scaling trigger. This prevents false positives from a single metric anomaly.

Test Autoscaling in Staging

Always validate autoscaling behavior in a non-production environment. Simulate traffic spikes using tools like k6, locust, or hey to observe scaling latency, node provisioning time, and pod startup delays.

Monitor Scaling Events and Alerts

Integrate HPA and Cluster Autoscaler events into your observability stack. Use Prometheus alerts for:

HPA not scaling due to missing metrics
Cluster Autoscaler unable to add nodes (e.g., quota limits)
Pods pending for more than 5 minutes

Enable Node Affinity and Taints for Workload Isolation

Use node affinity rules to ensure critical workloads (e.g., databases) are scheduled on dedicated nodes not subject to autoscaling. Use taints and tolerations to prevent non-critical workloads from disrupting stable nodes.

Regularly Review and Update Autoscaling Policies

Application behavior changes over time. Re-evaluate HPA targets, VPA recommendations, and Cluster Autoscaler thresholds every 24 weeks. Use historical metrics to refine your thresholds.

Consider Cost Implications

Autoscaling can increase cloud costs if not managed carefully. Use spot instances for stateless workloads, implement scheduled scaling (e.g., scale down overnight), and consider using Kubernetes Cost Explorer or Kubecost to track spending per deployment.

Tools and Resources

Core Kubernetes Components

Metrics Server Collects resource usage data from kubelets
Horizontal Pod Autoscaler (HPA) Scales pod replicas based on metrics
Vertical Pod Autoscaler (VPA) Adjusts pod resource requests and limits
Cluster Autoscaler Adds or removes nodes based on scheduling pressure

Third-Party Tools

Prometheus + Prometheus Adapter Enables custom metric-based autoscaling
Kubecost Monitors cost per namespace, deployment, and autoscaling event
Datadog / New Relic / Grafana Cloud Advanced monitoring and alerting for autoscaling triggers
Argo Rollouts Canary deployments with autoscaling integration
Flux / Argo CD GitOps tools to manage autoscaling configurations as code

Documentation and References

Sample Scripts and Templates

Use these templates as starting points:

HPA with Custom Metric Scale based on Prometheus query
VPA with Recommendations Only Test before enabling auto-updates
Cluster Autoscaler for Multi-AZ Ensures high availability during node provisioning
CI/CD Integration Auto-deploy HPA changes via GitOps

Real Examples

Example 1: E-commerce Site During Black Friday

A retail company runs a Kubernetes cluster on AWS EKS hosting a microservice architecture for their online store. During Black Friday, traffic increases 10x from baseline.

HPA configured to scale the product catalog service from 4 to 50 replicas based on CPU and HTTP request rate (via Prometheus).
Cluster Autoscaler adds 15 additional m5.large nodes from a spot instance pool to accommodate the surge.
VPA increases memory requests for the cart service from 256Mi to 512Mi as session data grows.
PDB ensures at least 80% of product catalog pods remain available during node drain.

Result: The site handles 500K concurrent users with 99.98% uptime. Post-event, autoscaling reduces nodes to baseline, saving 65% in cloud costs.

Example 2: Real-Time Analytics Platform

A SaaS company processes real-time log data using a Kafka-based ingestion pipeline deployed on GKE.

HPA scales consumer pods based on Kafka lag (custom metric via Prometheus Adapter).
When lag exceeds 10,000 messages, HPA scales up by 5 pods every 2 minutes.
Cluster Autoscaler adds n1-standard-4 nodes when pending pods exceed 5.
VPA adjusts memory limits dynamically as data payloads vary by hour.

Result: Processing latency remains under 2 seconds during peak ingestion. Without autoscaling, latency would have exceeded 15 minutes.

Example 3: Internal Dev Tools with Scheduled Scaling

A startup runs internal CI/CD tools (Jenkins, SonarQube) on a small AKS cluster. Usage is high during business hours and near-zero overnight.

HPA scales Jenkins agents from 1 to 10 based on queue length.
Cluster Autoscaler enabled with min=2, max=8.
External Scheduler uses a cron job to scale down node pool to 1 node at 7 PM and scale up to 5 at 8 AM.

Result: Monthly cloud costs reduced by 40% without impacting developer productivity.

FAQs

Whats the difference between HPA and VPA?

HPA scales the number of pod replicas horizontallyadding or removing instances. VPA adjusts the CPU and memory resources allocated to each individual pod verticallyincreasing or decreasing the request and limit values.

Can I use HPA and VPA together?

Yes, but with caution. HPA and VPA can conflict if VPA changes resource requests while HPA is scaling. Use VPA in Initial or Off mode in production, or use VPA only for long-term trend adjustments and HPA for real-time scaling.

Why isnt my HPA scaling?

Common reasons include:

Metrics Server not running or unreachable
Pods lack resource requests
Target metric is unreachable (e.g., custom Prometheus metric not exposed)
HPA is in FailedCondition statecheck kubectl describe hpa
Pods are in CrashLoopBackOff or Pending state

How long does Cluster Autoscaler take to add a node?

Typically 15 minutes, depending on cloud provider and node image provisioning time. Spot instances may take longer due to availability constraints.

Does autoscaling work with StatefulSets?

Yes, HPA supports StatefulSets. However, VPA has limited support for StatefulSets due to the complexity of preserving stateful data during resource changes. Use HPA with StatefulSets for replica scaling.

Can I autoscale based on external events like GitHub commits or Slack messages?

Yes, using custom metrics. For example, a webhook can push commit count to Prometheus, and HPA can scale based on that metric. Tools like KEDA (Kubernetes Event-Driven Autoscaling) automate this process.

What is KEDA?

KEDA (Kubernetes Event-Driven Autoscaling) is a lightweight, open-source component that enables event-driven autoscaling for any Kubernetes workload. It supports over 40 event sources including Kafka, RabbitMQ, Azure Queues, GitHub, and more. KEDA can replace or enhance HPA for complex, event-based scaling scenarios.

Is autoscaling expensive?

It can be, if misconfigured. Overly aggressive scale-up or slow scale-down increases costs. Use cost monitoring tools, set max replicas, and combine with scheduled scaling or spot instances to optimize spend.

Should I use autoscaling for stateful applications like databases?

Generally, no. Databases like PostgreSQL or MongoDB are not designed for horizontal scaling. Use vertical scaling (VPA) cautiously, and prefer managed database services with built-in scaling. Avoid autoscaling databases unless youre using a distributed system like Vitess or CockroachDB.

How do I rollback a bad autoscaling configuration?

Use GitOps tools like Argo CD or Flux to version-control your HPA, VPA, and Cluster Autoscaler manifests. If an update causes issues, revert the Git commit and let the operator restore the previous configuration.

Conclusion

Autoscaling Kubernetes is not a single featureits a coordinated strategy that combines Horizontal Pod Autoscaling, Vertical Pod Autoscaling, Cluster Autoscaling, and custom metrics to create a self-optimizing infrastructure. When implemented correctly, it delivers resilience against traffic surges, reduces operational overhead, and lowers cloud costs by aligning resource allocation with actual demand.

This guide provided a comprehensive, practical walkthroughfrom deploying Metrics Server and configuring HPA to integrating with Prometheus and Cluster Autoscaler. Real-world examples demonstrated how enterprises leverage autoscaling to handle everything from Black Friday traffic to real-time data pipelines.

Remember: autoscaling thrives on accurate metrics, thoughtful thresholds, and disciplined monitoring. Avoid the trap of set it and forget it. Regularly review scaling behavior, validate against performance benchmarks, and refine policies as your applications evolve.

By mastering these techniques, you transform Kubernetes from a static orchestration platform into a dynamic, intelligent system that adapts to your workloads needsensuring optimal performance, availability, and efficiency at every scale.

alex

How to Autoscale Kubernetes

How to Autoscale Kubernetes

Step-by-Step Guide

Prerequisites

Step 1: Configure Horizontal Pod Autoscaler (HPA)

Step 2: Monitor HPA Status

Step 3: Enable Custom Metrics with Prometheus

Step 4: Implement Vertical Pod Autoscaler (VPA)

Step 5: Configure Cluster Autoscaler

AWS EKS

GCP GKE

Azure AKS

Step 6: Integrate with Pod Disruption Budgets (PDB)

Best Practices

Set Appropriate Resource Requests and Limits

Use Different Scaling Policies for Scale-Up and Scale-Down

Avoid Scaling Based on Memory Alone

Use Multiple Metrics for Stable Scaling

Test Autoscaling in Staging

Monitor Scaling Events and Alerts

Enable Node Affinity and Taints for Workload Isolation

Regularly Review and Update Autoscaling Policies

Consider Cost Implications

Tools and Resources

Core Kubernetes Components

Third-Party Tools

Documentation and References

Sample Scripts and Templates

Real Examples

Example 1: E-commerce Site During Black Friday

Example 2: Real-Time Analytics Platform

Example 3: Internal Dev Tools with Scheduled Scaling

FAQs

Whats the difference between HPA and VPA?

Can I use HPA and VPA together?

Why isnt my HPA scaling?

How long does Cluster Autoscaler take to add a node?

Does autoscaling work with StatefulSets?

Can I autoscale based on external events like GitHub commits or Slack messages?

What is KEDA?

Is autoscaling expensive?

Should I use autoscaling for stateful applications like databases?

How do I rollback a bad autoscaling configuration?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags