Use KEDA Scaling Modifiers to Manage AI Infrastructure
Sam Farid
CTO/Founder
TL;DR: KEDA’s new scaling modifiers unlock new tactics to dynamically and efficiently manage AI infrastructure in Kubernetes. By integrating intelligent scaling triggers—like validation drift or token generation latency—you can optimize resource usage, reduce costs, and achieve high performance.
What Are KEDA’s Scaling Modifiers?
KEDA extends Kubernetes’ native Horizontal Pod Autoscaling (HPA) to handle custom metrics and events. With standard HPAs, you often scale your deployment’s replicas based solely on CPU or Memory utilization. In contrast, KEDA enables event-driven scaling—for example, it can trigger additional replicas after receiving a high volume of user notifications.
Scaling Modifiers went GA in KEDA 2.15, enabling teams to declare powerful and flexible scaling criteria through formulas and conditions. It’s built in by default with KEDA’s simple helm chart install, and after experimenting since the release, we’re convinced this capability is game-changing for managing modern, AI-driven workloads.
Why Do Scaling Modifiers Matter for AI Infrastructure?
AI and machine learning workloads often involve complex resource demands, including large GPU clusters, dynamic model retraining schedules, and unpredictable token-level throughput. Traditional autoscaling triggers—like CPU or memory—don’t always correlate with meaningful performance indicators for AI.
Scaling modifiers allow you to:
- Align scaling decisions with business-critical metrics: For example, triggering scaling based directly on model validation drift ensures output quality doesn’t fluctuate.
- Improve cost-efficiency: Only add more replicas or allocate GPUs when genuinely needed, rather than on a fixed schedule.
- Enhance reliability and resilience: By reacting to real-time signals directly related to user experience such as token latency, you can maintain stable response times despite traffic changes.
Below are some applications to show off the Power and Glory of scaling modifiers:
1. Dynamic Model Retraining on Validation Failure and GPU Availability
Challenge: Traditional model retraining is typically run on a predefined schedule (i.e. a cronjob), but this has downsides: If it’s not yet necessary to retrain, it’s running too early and wasting precious resources. On the other hand, if the model’s validation scores drift quickly between triggers, retraining kick off too late and you’re already serving bad results:
Solution: Use scaling modifiers to trigger retraining only when validation scores drift beyond a certain threshold, when GPUs are free, and when queue length is manageable to avoid interfering with production service.
Configuration Example:
advanced:
scalingModifiers:
formula: "validation_drift > 0.1 && available_gpus >= 2 && request_queue_length < 1000 ? 1 : 0"
activationTarget: "1"
With an activation target, the retraining workload will scale from 0 to 1 whenever each component in the formula is true, no longer dependent on time-based triggers.
What This Achieves:
- Resource Optimization: Prevents unnecessary retraining.
- Timely Model Updates: Initiates retraining as soon as model quality degrades.
- Production Safety: Ensures no conflict with ongoing serving workloads.
2. Scale Model Infrastructure on Tokens, Not Just Traffic
Challenge: End-to-end latency is an imperfect metric for model load because outputs vary in length. A better indicator is token-level latency, which directly correlates with model workload and performance.
Solution: Configure scaling modifiers to focus on token throughput per pod. When token generation slows down, it means the model is reaching its performance limit, and you can scale up replicas to maintain responsiveness.
Configuration Example:
advanced:
scalingModifiers:
formula: "1 / (tokens_per_min / pod_count)"
target: "0.001"
The rate of token generation per pod will decrease when the model replicas are becoming unhealthy, so this formula watches for the inverse of this rate to increase, creating more replicas as token latency rises.
What This Achieves:
- Accurate Load Measurement: Scale based on how quickly tokens are generated, not just raw traffic.
- Better User Experience: Keeps response times stable even as prompt lengths grow more complex.
- Cost Control: Prevent unnecessary over-provisioning, only scaling when token latency proves it’s needed.
3. Model Rebalancing for Changing Traffic Patterns
Challenge: During traffic spikes, using a single large model can balloon costs and latency. Sometimes, it’s better to dynamically switch to a more cost-effective (even if slightly less capable) model to handle increased load.
Solution: Use scaling modifiers to trigger additional replicas of a cheaper model once traffic (measured by tokens per minute) crosses a certain threshold. This helps you rebalance workloads, maintain throughput, and manage costs during peak times.
Configuration Example:
advanced:
scalingModifiers:
formula: "tokens_per_min >= 5000 ? request_rate : 0"
activationTarget: "1"
target: "100"
This formula shows how to trigger scaling for a workload serving a cheaper model only after observing high throughput, allowing for conditional load balancing between models.
What This Achieves:
- Elastic Strategy: Rapidly introduce cheaper models when load spikes.
- Cost Efficiency: Prevent runaway costs during peak periods.
- Sustainable Performance: Maintain user satisfaction without over-provisioning.
Additional Tips for Using KEDA Scaling Modifiers
- Combine Metrics for Granular Control: You don’t have to limit scaling to a single metric. Consider combining validation drift, token latency, and GPU availability for holistic autoscaling policies.
- Test in a Staging Environment: Before rolling out changes to production, run experiments with scaling modifiers in a controlled environment. Fine-tune your formulas and targets to avoid oscillation or overshoot. It’s a powerful technology but it’s still rather new.
- Monitor Logs and Metrics Over Time: Use dashboards and logging tools to keep an eye on how scaling configuration affects performance, cost, and reliability. Continuous monitoring ensures you can catch if the scaling behavior is not acting as expected, or if it needs to update when your initial assumptions have drifted due to evolving applications or traffic .
- Integrate with CI/CD Pipelines: Treat scaling formulas as code. Version control them, and use CI/CD pipelines to review changes. This keeps your autoscaling strategies as agile and well-managed as your application code.
The Future of AI Infrastructure with KEDA
Scaling Modifiers are new, but we anticipate they’ll rapidly become a core best practice for managing large-scale, cloud-native AI workloads. Their ability to tie scaling decisions directly to meaningful business and performance metrics makes them invaluable in today’s complex, data-driven environments.
We’d like to congratulate the KEDA-Core team on this release and recommend checking out their talk from Kubecon-Europe.
And if you’re interested in exploring how Flightcrew can help you streamline KEDA’s scaling modifiers or if you just want to chat about your use cases, drop us a line at hello@flightcrew.io .
Sam Farid
CTO/Founder
Before founding Flightcrew, Sam was a tech lead at Google, ensuring the integrity of YouTube viewcount and then advancing network throughput and isolation at Google Cloud Serverless. A Dartmouth College graduate, he began his career at Index (acquired by Stripe), where he wrote foundational infrastructure code that still powers Stripe servers.