What Superhuman's 200K QPS Inference Stack Tells Enterprises About Building Custom AI

Every enterprise AI conversation eventually hits the same wall: the proof-of-concept works beautifully, then production traffic shows up and the whole thing collapses under its own latency. Superhuman and Databricks just published the receipts on what it actually takes to run a custom large language model at 200,000 queries per second with sub-second P99 latency — and the story isn’t about a clever model. It’s about who owns which piece of the stack, and why most enterprises are about to rethink the “build vs. buy” question entirely.

According to the joint write-up from Databricks, Superhuman serves over 40 million daily users across dozens of languages, and its custom grammar correction model handles peak traffic exceeding 200,000 QPS with end-to-end latency under one second at P99 and four-nines reliability. Those numbers matter because they represent the kind of workload most enterprise buyers assume is reserved for hyperscalers. It isn’t. But getting there required a very specific arrangement between vendor and customer that most procurement teams aren’t structured to negotiate.

Why The Old DIY Serving Stack Stopped Working

Before the migration, Superhuman ran its own serving infrastructure built on vLLM with L40S GPUs, maintained by an internal ML infrastructure team. According to the original report, each new model iteration required months of manual performance tuning, and a lean team was spending its time on capacity planning and autoscaling instead of model quality.

Nearly every mid-to-large enterprise running custom AI hits the same trap. You start with a DIY stack because off-the-shelf inference services don’t meet your latency or cost requirements. Two years later, you’ve built a small platform team whose entire job is keeping that stack alive, and your data scientists are blocked behind infrastructure tickets. The opportunity cost is invisible until you measure it, and by then your competitors have shipped three model generations.

If you’re running a fintech platform with real-time fraud scoring, or a SaaS product where AI suggestions need to feel instantaneous, this is exactly where you’re headed, or already are. The work isn’t going away — someone has to handle quantization, kernel optimization, and autoscaler tuning — but it doesn’t have to be your team. Our take: within 18 months, “who owns the inference runtime” will be the most contentious line item in enterprise AI budgets, and most CTOs will choose to outsource it.

The Partnership Model That Actually Works For Custom Models

What made the Superhuman migration work wasn’t the move to a managed service — they kept full ownership of model training, quantization, and quality standards, while Databricks took on runtime performance and platform reliability. Both teams agreed on shared SLOs upfront: sub-second P99 latency and zero quality regression on Superhuman’s internal evaluation harnesses.

This matters because the dominant industry framing — “build it yourself or use a managed API” — is a false choice for any company with a differentiated model. If you’ve spent two years fine-tuning a model on proprietary data, handing it to a generic inference API means losing the granular control that made the model valuable in the first place. But running the whole stack yourself means absorbing a permanent infrastructure tax. The middle path is a contractual partnership where the vendor commits engineering hours to your specific workload, not just compute credits.

Imagine you’re a healthcare company with a custom diagnostic model that has to clear regulatory evaluation harnesses every release. A traditional managed service won’t sign up for your quality bar. A DIY stack means hiring a five-person platform team. The Superhuman pattern — keep the model, outsource the runtime, share the SLOs — is what makes AI-integrated software solutions commercially viable for companies that aren’t Google. Expect this contract structure to become a standard procurement template by the end of 2026.

What 60 Percent More Throughput Per Pod Actually Costs

According to Databricks, the joint team increased per-pod throughput from 750 QPS to 1,200 QPS on H100 GPUs — a 60% improvement — with zero quality regressions. The single largest contributor was FP8 quantization, which delivered up to 30% increase in per-pod QPS on its own.

Getting there required tradeoffs most teams can’t staff for. Superhuman’s ML team prequantized the checkpoint using vLLM’s online quantization library. Databricks used per-channel scaling instead of off-the-shelf per-tensor scaling, preserving dynamic range. KV-cache quantization was deliberately left disabled because the quality tradeoffs weren’t worth it for this specific workload. A multiprocessing runtime server added another 20% throughput by breaking the single-process CPU bottleneck that small fast models create.

Buyers consistently underestimate this. The performance gains aren’t from buying better hardware — they’re from dozens of micro-decisions about which layers to quantize, when to scale aggressively, and how to overlap CPU post-processing with GPU forward passes. If you’re a CTO evaluating whether to build your own inference platform, the honest question is: do you have the headcount to make 30 of those decisions correctly, and to redo them every time the model changes? For most teams, the answer is no, which is why the AI agents vs AI automation decision increasingly hinges on who maintains the runtime, not which framework you pick.

How Infrastructure Choices Shape User Experience

The load balancing and container startup details look like backend trivia, but they’re actually the parts your users feel. According to the original report, Superhuman’s grammar correction endpoint shows strong diurnal traffic patterns with rapid ramps often exceeding 200K QPS. Default Kubernetes round-robin load balancing creates hotspots at high QPS that spike tail latency, so Databricks built a custom power-of-two-choices load balancer driven by an Endpoint Discovery Service that continuously monitors the Kubernetes API.

Container cold starts got similar treatment. Pulling a standard container image during a traffic ramp could take several minutes per pod, which would show up as user-visible latency spikes. The team adopted block-device-based image acceleration originally built for Databricks serverless compute, converting standard gzip images to a seekable block format with lazy loading. The result: pod start times dropped from minutes to seconds.

If you’re running a B2B platform where users notice every 200ms of delay — and you definitely are, because they do — these are the optimizations that separate a product that feels alive from one that feels like a Google Form. Most engineering teams will never build a power-of-two-choices load balancer from scratch. They shouldn’t have to. Platforms are starting to ship these as defaults; the custom API and integration work around them is where differentiation actually lands. Prediction: by 2027, “sub-second P99 at variable QPS” will be a checkbox on enterprise SaaS RFPs, not a custom engineering project.

FAQ

Q: What is custom AI for enterprise, and how is it different from using an off-the-shelf API? A: Custom AI for enterprise means owning the model — its training data, fine-tuning, quantization, and quality evaluation — rather than calling a generic foundation model API. The Superhuman case shows the modern pattern: the customer owns the model and quality bar, while a platform partner like Databricks owns the runtime, autoscaling, and SLA delivery. This split lets you keep proprietary advantages while offloading infrastructure work.

Q: How fast can a custom large language model realistically run in production? A: According to the Databricks and Superhuman joint report, their custom model handles peak traffic above 200,000 QPS with end-to-end latency under one second at P99 and four-nines reliability. Per-pod throughput on H100 GPUs reached 1,200 QPS after optimization, up from 750 QPS — a 60% improvement driven largely by FP8 quantization and CPU bottleneck elimination.

Q: When does it make sense to migrate from a DIY inference stack to a managed platform? A: When the operational burden — capacity planning, performance tuning, and autoscaling — starts consuming engineering time that should go to model quality. Superhuman’s trigger was that each new model iteration required months of manual tuning. If your platform team is bigger than your ML team, that’s usually the signal.

Key Takeaways

Enterprises with custom models should negotiate shared SLOs with inference platform vendors rather than accepting standard managed-service terms — the Superhuman contract structure will become a procurement template.
The 60% per-pod throughput gain came from quantization choices and CPU optimizations that most internal teams don’t have the headcount to evaluate; outsourcing the runtime is increasingly the rational choice.
Sub-second P99 latency at variable QPS depends on infrastructure details like power-of-two load balancing and lazy-loaded container images — these will become default expectations, not differentiators, within two years.
The “build vs. buy” framing for enterprise AI is obsolete; the real question is which layer of the stack you own and which layer you contract for.
If your ML engineers are spending more time on Kubernetes than on model evaluations, your competitive position is eroding faster than your metrics will show.

Why The Old DIY Serving Stack Stopped Working

The Partnership Model That Actually Works For Custom Models

What 60 Percent More Throughput Per Pod Actually Costs

How Infrastructure Choices Shape User Experience

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?