GPU-Aware Autoscaling on Kubernetes: Why KEDA Needs an External Scaler for AI Workloads

Kubernetes autoscaling has a blind spot, and it’s the most expensive piece of silicon in your cluster. If you’re running vLLM, Triton, or training jobs on GPU nodes, the Horizontal Pod Autoscaler is still making decisions based on CPU and memory — while a $30k accelerator sits half-idle or melts under load. One engineer just published a reference implementation that fixes this by sidestepping a structural limitation in KEDA itself, and the architectural pattern is worth studying even if you never touch their code.

Why KEDA Can’t Just Add a GPU Scaler

KEDA, the CNCF-graduated event-driven autoscaler, is compiled with CGO_ENABLED=0. NVIDIA’s Management Library (NVML) — the canonical way to read GPU utilization, VRAM, temperature, and power draw — requires CGO. According to the author of the keda-GPU-scaler project, that single build flag is why you can’t drop a GPU scaler into KEDA core the way you can with Prometheus or Kafka. There’s a second structural problem too: the KEDA operator runs as a single deployment, but NVML calls are node-local. You cannot query GPU 0 on node-A from a pod scheduled on node-B.

This matters because it forces a specific architecture on anyone who wants real GPU-aware scaling. You cannot patch this with a sidecar or a clever annotation. You need a per-node agent, network-exposed metrics, and a protocol KEDA already speaks. If you’re a platform team trying to ship LLM inference on Kubernetes, this is the difference between an afternoon’s work and a multi-week design exercise — knowing the constraint up front saves you from architecting yourself into a corner.

Our take: the CGO restriction will outlive most attempts to relax it, because keeping KEDA’s core static-binary-friendly is genuinely valuable. External scalers are the official escape hatch, and projects that fight that grain will lose.

The DaemonSet Plus gRPC Pattern Is the Right Answer

The reference implementation runs a DaemonSet on every GPU node. Each pod calls NVML through the go-nvml bindings to read local GPU metrics, then serves them over gRPC using KEDA’s ExternalScaler interface. The KEDA operator connects to the scaler service and drives HPA decisions from there. The author points out this is the same shape Kubernetes uses for device plugins and the metrics server — a per-node agent that surfaces local hardware data through a well-known interface.

It’s a pattern, not a one-off hack. Any hardware signal that lives on a specific node — TPU stats, FPGA telemetry, custom ASIC counters, even NVMe wear leveling — fits the same mold. If you’re building custom AI agents or inference platforms that need to react to physical resource pressure, you now have a template that doesn’t require forking KEDA or maintaining a custom operator. Imagine a fleet running mixed H100 and A100 nodes serving different model sizes: the DaemonSet exposes per-node metrics, and a single ScaledObject per workload can pick the aggregation mode that fits.

Our take: expect to see this pattern copied for AMD Instinct and AWS Trainium within the next year. The CNCF ecosystem rewards composable extensions over monolithic features, and KEDA’s ExternalScaler is becoming the de facto integration point for non-CPU resources.

What You Can Actually Scale On — and Why It’s a GreenOps Story

The scaler exposes five metrics per GPU: gpu_utilization (SM compute), memory_utilization (memory controller), memory_used_percent (VRAM), temperature in celsius, and power_draw in watts. For multi-GPU nodes you choose max, min, avg, or sum, or target a specific GPU index. The author frames this explicitly as a GreenOps concern, not just a cost one — wasted GPU cycles convert directly into wasted energy and higher Scope 3 emissions.

Most platform teams justify autoscaling work on infra spend alone, but accelerator power draw is becoming a board-level metric. A vLLM pod sitting at 5% SM utilization is burning watts whether it’s serving tokens or not. Scaling on power_draw instead of pod count gives you a knob that ties directly to your sustainability dashboard. If you’re a team building multi-tenant SaaS platforms on GPU infrastructure, exposing per-tenant power consumption suddenly becomes possible — and that’s a billing dimension nobody is shipping yet.

Our take: by 2027, GPU power draw will be a first-class autoscaling signal in every serious inference platform, and teams who built around CPU-only HPA will be ripping out their scaling logic.

Pre-Built Profiles Lower the Activation Energy

The project ships profiles for common workload types so you don’t have to think about thresholds from scratch. The vllm-inference profile targets 80% memory_used_percent with a 5% activation threshold and supports scale-to-zero. triton-inference uses 75% gpu_utilization with 10% activation. training runs at 90% gpu_utilization with no scale-to-zero. batch is aggressive: 70% memory with 1% activation. A ScaledObject becomes a six-line YAML block pointing at the scaler service.

Profiles are what determine adoption. Most GPU teams know what they want — “scale my vLLM deployment when VRAM fills up” — but translating that into HPA math is tedious. Profiles encode the institutional knowledge. If you’re shipping an LLM endpoint behind a chatbot and your costs are creeping up (a real concern, and one the 2026 AI chatbot cost guide breaks down in detail), flipping on the vllm-inference profile with scale-to-zero is a one-day win.

Our take: the projects that win in this space will ship opinionated defaults, not configuration matrices. Operators want a profile name, not a knob farm.

Why Testing Without GPUs Matters More Than the Feature List

The scaler includes a mock collector mode and an end-to-end test suite that spins up a real gRPC server with fake GPU data. The author reports 11 tests covering profiles, error paths, streaming, and aggregation modes, all running in CI without GPU hardware via go test -v -tags=e2e -race ./tests/e2e/.

That detail signals the project is serious. Anyone who has tried to CI-test GPU code knows the pain: either you pay for GPU runners, or you skip integration tests entirely. A mock collector that exercises the full IsActive → GetMetricSpec → GetMetrics flow means contributors can submit PRs without burning A100 minutes, and downstream forks stay testable. For platform teams evaluating whether to depend on this, the test surface is the real risk indicator.

Our take: external scalers without mockable test paths will not get adopted in regulated environments. CI evidence is becoming a procurement requirement, not a nice-to-have.

FAQ

Q: What is KEDA and how does it differ from the standard Kubernetes HPA? A: KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF-graduated project that extends the Horizontal Pod Autoscaler with external event sources — Kafka lag, Prometheus queries, cloud queue depth, and more. The standard HPA only scales on CPU and memory metrics. KEDA also supports scale-to-zero, which the HPA cannot do natively.

Q: Why can’t KEDA read GPU metrics directly? A: KEDA is built with CGO_ENABLED=0 to keep the binary static and portable, but NVIDIA’s NVML library requires CGO. Additionally, NVML calls are node-local, while the KEDA operator runs as a single deployment, so a per-node agent exposed over the network is the architecturally correct solution.

Q: Can I use this with non-NVIDIA GPUs? A: The reference implementation uses go-nvml, which is NVIDIA-specific. The same DaemonSet plus gRPC pattern would work for AMD ROCm or other accelerators by swapping the collector implementation, but no public ports exist at the time of writing.

Key Takeaways

Platform teams running GPU workloads should treat CPU-based HPA as a bug, not a baseline — power draw and VRAM are the signals that actually correlate with cost and SLOs.
The DaemonSet plus ExternalScaler pattern is reusable for any node-local hardware metric, so invest in understanding it before the next accelerator generation lands.
Scale-to-zero on inference workloads is now a one-YAML-file change for vLLM deployments, which collapses the cost gap between always-on and serverless inference.
Expect GPU power draw to become a board-reported sustainability metric within two years; teams without per-workload power telemetry will be scrambling to add it.
Any external scaler you adopt should ship with a mock collector and CI-runnable e2e tests, or you’re inheriting an untestable dependency.

Why KEDA Can’t Just Add a GPU Scaler

The DaemonSet Plus gRPC Pattern Is the Right Answer

What You Can Actually Scale On — and Why It’s a GreenOps Story

Pre-Built Profiles Lower the Activation Energy

Why Testing Without GPUs Matters More Than the Feature List

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?