Frontier Models Just Failed Their First Real SRE Exam — And That's the Most Honest Benchmark We've Had in Months

Every vendor pitch deck this year promises an AI agent that can run your infrastructure. Then Artificial Analysis and IBM Software Innovation Lab dropped ITBench-AA, pointed frontier models at real Kubernetes incidents, and watched the entire industry score below 50%. Claude Opus 4.7 topped the chart at 47%. GPT-5.5 came in at 46%. Qwen3.7 Max at 42%. This isn’t a benchmark you brag about — it’s the one that finally tells the truth about agentic IT work.

Why a Sub-50% Score Is the Whole Point

ITBench-AA is the first in a planned series of benchmarks for agentic enterprise IT, starting with Site Reliability Engineering and expanding into FinOps and CISO tasks over time. According to Artificial Analysis, every frontier model evaluated scored below 50%, making it one of the least saturated agentic benchmarks in their suite — a sharp contrast to Terminal-Bench, where the same models score considerably higher.

That gap matters because most agent benchmarks have been quietly maxing out. When a benchmark saturates, you can’t tell Claude apart from Qwen apart from Gemini, and buyers are left reading marketing copy. A benchmark where the leader sits at 47% with real headroom gives engineering teams something they can actually use to pick a model — and gives model labs something concrete to climb.

If you’re an SRE lead evaluating whether to hand part of your incident response to an agent, this is the first dataset that will tell you, with attribution, whether the model can read logs, walk a Kubernetes topology, and name the right root-cause entity. The answer right now is: less than half the time.

My take: expect the 2026 round of frontier model releases to start citing ITBench-AA the way 2024 releases cited SWE-bench. Saturated benchmarks get replaced; honest ones get climbed.

What the Tasks Actually Look Like

ITBench-AA SRE ships 59 tasks total — 40 public and 19 held-out tasks — each one a frozen Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model runs inside Artificial Analysis’s open-source Stirrup harness with shell access to a sandboxed filesystem, capped at 100 turns per task and repeated three times. Its job is to submit the minimal set of independent root-cause Kubernetes entities, scored under average precision at full recall: miss any ground-truth cause and you get 0.0 for the repeat.

The scoring is the cleverest design choice. It punishes the exact failure mode that makes AI agents annoying in production — over-investigating and dumping a wall of “possibly related” entities on the on-call engineer. The harness stays constant across all models, so you’re comparing reasoning quality, not tool plumbing.

Imagine you’re running an AI-integrated platform that ships an incident-response copilot. Under ITBench-AA’s rules, your agent doesn’t win points for finding the chaos-mesh controller that injected the fault — it loses points, because that’s a false positive next to the real root cause like otel-demo/NetworkPolicy/frontend-block-all-ports. Precision is a feature, not an afterthought.

My take: the recall-gated precision metric will quietly become the standard for any agent benchmark that touches operations. Loose grading was always going to break the moment agents started being measured against on-call engineers.

The Counterintuitive Lesson: More Thinking, Worse Answers

The turn-count data buried in the results is the headline every model team should be staring at. Per Artificial Analysis, turn counts vary nearly 3x across models, and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%. Gemini 3.1 Pro Preview averages 83 turns and scores 30%. Gemma 4 31B (Reasoning), an open-weights model, averages 58 turns and scores 37% — beating Gemini 3.1 Pro Preview while using fewer turns.

That turn-count correlation is the most actionable signal in the report. The conventional wisdom of the last 18 months has been “more reasoning steps = better outcomes”, which has driven up token costs and pushed teams toward expensive max-effort configurations. ITBench-AA shows the opposite for diagnostic work: models that over-investigate surface upstream fault-injection mechanisms or co-occurring symptoms as false positives, then get penalized for them.

If you’re building an autonomous diagnostic agent, the takeaway is uncomfortable: your prompt scaffolding probably needs to reward stopping, not exploring. A model that confidently submits three entities and quits will outscore a model that submits twelve after extensive deliberation, even when both found the right answer.

My take: within six months, expect agent frameworks to ship explicit “submission discipline” controls — budgets that force the model to commit and stop, rather than keep poking around.

Open Weights Are Sitting on the Cost Frontier

The pricing data flips the usual narrative about open-weights models being a budget compromise. Gemma 4 31B (Reasoning) scores 37% at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23 per task, 30%) on both score and cost. GLM-5.1 (Reasoning) scores 40% at $1.23 per task, matching Gemini 3.5 Flash (high) at $1.70 on score while costing less. Claude Opus 4.7 leads at 47% but is the most expensive at $5.38 per task.

For anyone running incident response at scale — think a managed services provider handling thousands of Kubernetes alerts a day — that $0.14 vs $5.38 spread is the difference between an AI cost line that’s a rounding error and one that needs a board slide. And the cheaper option isn’t last on the leaderboard; it’s beating a frontier closed model.

If your team has been deferring an open vs proprietary AI decision, ITBench-AA gives you a defensible answer for diagnostic workloads: open weights are competitive on quality and dominant on cost-per-task. The premium for closed frontier models is real but narrow.

My take: the first vendor to ship a production SRE agent built on Gemma 4 or GLM-5.1, tuned specifically for the recall-gated precision pattern, will undercut the Claude/GPT-based competitors on price by an order of magnitude — and the quality gap will be small enough that buyers won’t care.

FAQ

Q: What is ITBench-AA? A: ITBench-AA is a new benchmark from Artificial Analysis and IBM Software Innovation Lab that evaluates AI models on agentic enterprise IT tasks. It launches with 59 Site Reliability Engineering tasks based on Kubernetes incident snapshots, and will expand to FinOps and CISO tasks over time.

Q: Why are all frontier models scoring below 50%? A: The tasks require agents to identify the exact minimal set of root-cause Kubernetes entities for an incident, using shell access to logs, traces, metrics, and topology. Scoring uses average precision at full recall, which means missing any true root cause scores 0.0 and adding extra “contributing” entities counts as false positives — punishing the over-investigation behavior most models default to.

Q: Does spending more on a model mean better results? A: Not on this benchmark. Per the published results, Gemma 4 31B (Reasoning) scores 37% at $0.14 per task while Gemini 3.1 Pro Preview costs $2.23 per task and scores 30%. Claude Opus 4.7 leads at 47% but at $5.38 per task — a roughly 38x premium over Gemma for ten percentage points.

Key Takeaways

If you’re evaluating models for incident-response automation, run them through ITBench-AA before trusting any vendor’s marketing — it’s currently the most honest benchmark out there
Tune your agent harness to reward stopping early; the data shows over-investigation actively destroys precision under recall-gated scoring
Open-weights models like Gemma 4 31B and GLM-5.1 deserve a slot in your evaluation shortlist for diagnostic workloads, not just a footnote
Watch for FinOps and CISO benchmarks from the same partnership next — they’ll likely expose similar honesty gaps in other operational domains
Expect the next generation of agent frameworks to ship explicit submission-discipline controls, because turn-count optimization is now a measurable cost driver

Why a Sub-50% Score Is the Whole Point

What the Tasks Actually Look Like

The Counterintuitive Lesson: More Thinking, Worse Answers

Open Weights Are Sitting on the Cost Frontier

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?