An AI model just developed an exploit technique that two veteran browser security researchers had previously written off as too complex. That’s not a hypothetical from a threat-modeling deck — it happened inside ExploitBench, a new benchmark from Carnegie Mellon University that measures whether AI agents can autonomously attack real vulnerabilities in Google’s V8 JavaScript engine. The headline isn’t that Claude Mythos beat GPT-5.5. It’s that the gap between “toy capability” and “competent browser security researcher” closed quietly while most teams were still arguing about whether agents could refactor a React component.
What ExploitBench Actually Measures
Most prior “can an LLM hack” benchmarks check a single binary outcome: did the bug trigger or not. ExploitBench, built by researchers at Carnegie Mellon University, scores progress across five tiers, escalating all the way to T1 — arbitrary code execution on the target system. The target matters: V8 powers Chrome, Edge, Node.js, and Cloudflare Workers, which means a working V8 exploit is not an academic curiosity. It’s the entry point to a meaningful chunk of the consumer and serverless internet.
If you’re a security team running bug bounty triage, this is the benchmark you should be tracking instead of generic “capture the flag” leaderboards. The tiered scoring tells you not just whether a model can poke a bug, but whether it can chain primitives into something a real attacker would care about. That distinction — proof-of-concept versus weaponized — is exactly where defender effort has historically been spent. A benchmark that grades it explicitly is overdue.
Claude Mythos Versus GPT-5.5 — and Why the Numbers Look Lopsided
According to ExploitBench’s published results, Anthropic’s Claude Mythos Preview, allowed occasional human “nudges,” hit an average score of 9.90 out of 16 and reached the top tier on 21 of 41 vulnerabilities. OpenAI’s GPT-5.5 trailed at 5.51 points and reached T1 on just two. In fully autonomous mode the picture barely shifts for Mythos (9.55) while GPT-5.5 via Codex falls to 4.30. No other tested model achieved full code execution at all.
The practical read is that for the first time, a publicly tested frontier model can act, end to end, like what ExploitBench co-author Seunghyun Lee — a researcher with over 20 reported browser vulnerabilities — described as a “fairly competent browser / JS engine security researcher.” Lee also noted that Mythos reproduced CVE-2024-0519, a bug that human researchers had failed to crack for over a year. If you’re a vendor shipping a JavaScript runtime, your threat model now has to assume an attacker with a tireless mid-level vuln researcher on tap. That changes patch cadence, fuzzing budgets, and how you scope internal red teams.
My take: the score gap is real, but the more interesting signal is that the shape of the work — hypothesis, primitive construction, iteration — is no longer beyond reach for a single model in a loop.
The Cost Story Nobody Wants to Talk About
Here’s the part the leaderboard partially obscures. ExploitBench reports the full Mythos run cost roughly $36,428 across 122 episodes. GPT-5.5 via Codex ran 123 episodes for about $3,075 — roughly twelve times cheaper. The UK’s AI Safety Institute found the same: better Mythos performance, but at much higher cost.
That ratio reframes the result. A defender or attacker choosing tools doesn’t pick the best model in the abstract; they pick the best dollars-per-exploit-attempt. At twelve-to-one, a well-funded adversary can run GPT-5.5 across a much wider surface and accept a lower per-bug success rate. This is the same economic logic that makes the choice between an autonomous agent and a deterministic automation pipeline load-bearing for any team budgeting AI work — capability per dollar usually beats raw capability. It’s also why OpenAI could plausibly close the gap by simply throwing more compute at the problem, a point the original report explicitly raises.
If you’re a CISO budgeting for AI-assisted red teaming in 2026, the spreadsheet question isn’t “which model is smartest.” It’s “which model finds the most exploitable bugs per $10,000 of API spend.” Right now the answer is genuinely unclear, and that ambiguity is itself a planning problem.
What This Means for Defenders Shipping Software
The ExploitBench authors flag one real limitation: the tested bugs are publicly known, so models could draw on training data. The dataset does include vulnerabilities with no public exploit or bug report, but the benchmark doesn’t yet test novel flaw discovery or live weaponization. That’s a meaningful asterisk. It’s also a temporary one.
For product teams, the practical move is to stop treating LLM-driven vuln discovery as a 2027 problem. If your stack ships anything embedding V8 — an Electron app, a Node.js backend, a Workers deployment — your fuzzing infrastructure, regression suite, and patch SLA all need to assume an attacker can run a benchmark-grade agent against your code. Companies building AI directly into their products face the same exposure: the agent capabilities making their features useful are the ones being pointed at their dependencies. My prediction: within the next 12 months, at least one publicly disclosed CVE will credit an autonomous AI agent as the discoverer, and the disclosure will trigger a wave of vendor policy changes about how AI-found bugs get handled, paid, and triaged.
FAQ
Q: What is ExploitBench? A: ExploitBench is a benchmark from Carnegie Mellon University researchers that measures how well AI agents can exploit real vulnerabilities in Google’s V8 JavaScript engine. It scores progress across five tiers up to full arbitrary code execution, rather than just checking if a bug triggers. The benchmark is available on GitHub and the accompanying paper is on arXiv.
Q: Why is the Claude Mythos cost so much higher than GPT-5.5? A: According to ExploitBench, the full Mythos run cost roughly $36,428 across 122 episodes while GPT-5.5 via Codex ran 123 episodes for about $3,075 — roughly twelve times cheaper. The UK’s AI Safety Institute found the same: better Mythos performance, but at much higher cost. The original report notes that OpenAI could close that gap with more compute.
Q: Can these AI models discover new vulnerabilities? A: Not yet, at least not as measured by this benchmark. The ExploitBench authors note that all tested bugs are publicly known, though some lack public exploits or bug reports. The benchmark doesn’t test novel vulnerability discovery or live weaponization.
Key Takeaways
- Security teams shipping anything built on V8 — Chrome extensions, Electron apps, Node.js services, Cloudflare Workers — should start modeling AI-assisted exploit development as a near-term threat, not a future one.
- The right metric for evaluating offensive AI tools is exploits-per-dollar, not raw leaderboard score; the twelve-to-one cost gap between GPT-5.5 and Claude Mythos will reshape which model attackers and defenders actually deploy.
- Bug bounty programs and disclosure policies will need explicit language for AI-discovered vulnerabilities within the next year, and the first vendors to write that policy will set the template.
- Internal red teams should pilot agent-driven fuzzing against their own dependencies now, while costs are still high enough to constrain unsophisticated adversaries.
- Watch for OpenAI to respond with a compute-scaled variant of GPT-5.5 aimed specifically at narrowing the Mythos gap on benchmarks like ExploitBench — the economic incentive is too obvious to ignore.