Horizontal vs Vertical Scaling for Kubernetes Workloads

tldr- we're exposing our internal logic for how to pick the right autoscaler (HPA, KEDA, VPA) for your Kubernetes workload.

Kubernetes autoscaling can slow down engineers

The Kubernetes ecosystem offers many autoscaling tools so that every workload on a cluster can scale up up and down with traffic. This means your workloads won't OOM and you won't spend all your time updating resources and requests. The problem is that it’s a lot of work to choose, optimize, and maintain the correct autoscaler for everything running on your cluster(s).

Engineers need to make the correct decision for each workload, including:

Strategy: Horizontal vs Vertical
Tooling: HPA vs KEDA vs VPA
Standard Metrics (ex: Memory Utilization) vs Custom Metrics (ex: Queue Size)
Scaling thresholds, behavior, etc

Making these decisions requires Kubernetes expertise and familiarity with the workload’s structure, traffic profile and dependencies.

Our Guide to Picking a Kubernetes Autoscaler

Flightcrew is a config copilot that helps engineers bulletproof and scale their cloud infrastructure.

We process a firehose of data so that we can tell engineers how their configs affect production environments, and when they need to make an update. This means we’ve got a great dataset for how to match workloads to the correct workload.

We shared some of our aggregated data + recommendation logic with our friends at Zendesk and Doordash, and have put together a guide to help you make choices.

Guide to Pod Autoscaling.png

When is Autoscaling a bad idea?

Rarely!

Many teams hesitate on autoscaling because they're getting used to Kubernetes and they worry that autoscaling will negatively affect impact availability and performance. We're sympathetic; we're building Flightcrew because it can take months to master production workloads on kubernetes.

That said, autoscaling should work for the majority of architectures and workloads. The only exception in our view is stateful, non-restartable workloads (ex: a video stream). For these workloads you'll want to periodically resize your resource allocation to match your resource utilization.

This ‘Rightsizing pods’ workflow is a high-volume, manual task that forces engineers to think like SREs. There are many applications and dashboards on the market to help engineers make these decisions. We recommend that you use Flightcrew like a dependabot to safely, autonomously update resource allocation without slowing your engineers.

When should you use Vertical Scaling?

Vertical scaling means increasing the CPU and/or Memory of a workload to improve performance, and reducing these resources to manage costs.

In Kubernetes, Vertical Pod Autoscaling is a tool that dynamically changes workload limits and requests based on historical resource usage. This is obviously useful (avoid OOMs) and comes with obvious downsides (restarts, predictability).

Traditionally, we recommend using VPAs where the downsides don't apply:

If your workload is Stateful + Restartable ... VPA is a natural fit
If the workload is Restartable and you're not too performance sensitive ... try VPA!
If you don't need to worry about redundancy or are forced to use a single worker (ex: dameonset) ... VPA

Vertical Pod Autoscaling is about to become more popular

Teams avoid VPA because it’s not a good fit for their architecture (restartable workloads) and because they don’t want naive autoscaling logic. Restartable workloads won’t be an issue when Kubernetes allows VPA (and other tools) to update workload requests and limits. This is currently in alpha so we don’t recommend it for production workloads.

In the meantime:

You can run VPA in updateMode: Off to suggest resource limits without deploying them. Put these suggestions in your observability stack to keep your team accountable on availability and cloud costs.
You can also run updateMode:Initial to use VPA recommendations at pod creation - the VPA won’t interfere in lifecycle management.

When should you use horizontal scaling?

Scaling horizontally means creating multiple instances of your pod to handle the workload. Autoscaling on Fargate and Lamba is horizontal autoscaling.

Kubernetes has two options for horizontal scaling, Horizontal Pod Scaling (HPA) and Kubernetes Event Driven Autoscaling (KEDA).These tools will dynamically create and delete instances based on metric and event triggers. They’re incredibly powerful if you know what you’re doing.

Horizontal autoscaling should be your generic solution for stateful services. Why? 1.Redundancy is a universal goodness. You don't want a single instance for t1 workloads, you want multiple, properly provisioned instances. 2.Horizontal autoscaling is probably more performant if you take the time to optimize scale up / scale down.

How to pick between HPA and KEDA

We see a lot of teams ship HPA with ‘naive’ scaling on CPU and Memory. This is fine but if you really want to become cost efficient or deal with gnarly traffic let's talk about KEDA.

KEDA extends HPA to scale on external metrics and events. This means you can do powerful things like:

Scale on deterministic events like a launch or push notification
Scale on standard traffic patterns with cron
Scale utilization based on incoming load
Declare custom, formula-based behavior with scaling modifiers

If your workloads are dynamic and you believe standard metrics like Memory Utilization are the best representation of their underlying load, then maybe stick with HPA. KEDA is another object to maintain and a lot of this functionality is hard and still in Beta.

And what's this I hear about diagonal scaling?

'Diagonal Scaling' is when a workload scales vertically on one metric (ex: vertically on CPU, Horizontally on QueueSize) If you google the term you'll see a bunch of blogs, talks and SEO outlining theoretical approaches.

Don’t do it.

If you can't scale successfully on one axis, it probably means you've got a hidden bottleneck from misconfigurations or code misalignment. Try the simple things before you go inverted.

Who owns autoscaling

It’s inefficient (and unfair) to ask a busy engineer to make low level decisions on resource management and scaling. These require SRE experience and every hour your engineers spend tuning configs and checking grafana, is an hour they’re not spending on new features. We recommend that platform teams build a self-serve strategy to simplify and automate autoscaling for their end users. At first, strategy means education. An internal doc or wiki helps the brave navigate a few safe choices. At scale, top teams template IAC and automate logic so that autoscaling is abstracted away from feature teams.

Use Flightcrew to pick between HPA, VPA and KEDA

Flightcrew is a config copilot that helps engineers bulletproof and scale their cloud infrastructure. Flightcrew integrates into IDE and GitOPs workflows to proactively catch and fix configuration issues.

You can use Flightcrew to:

Generate a PR with the the right Autoscaler, Metric, behavior and thresholds for your workload
Monitor your code and infrastructure for drift, and suggest optimization opportunities
Simplify and explain autoscaling configs to your engineers.

Any questions? We'd love to compare notes on autoscaling and share more on what we're building - send us an email!