Skip to main content
Back to Blog
aialphaproof-nexusagentic-loopsformal-verificationllm-toolsgoogle-deepmindai-agent-architecture

When Math Research Costs Less Than a Latte Run: What AlphaProof Nexus Signals About AI Agents

DeepMind's AlphaProof Nexus solved 56-year-old math proofs for ~$200 each — proof that agentic ai closed loop verification beats complex orchestration layers.

Zyfolks Team ·

Open math problems used to demand decades of human attention and university budgets to match. This week, Google Deepmind cracked nine of them — including two questions that had sat unanswered for 56 years — for roughly the price of dinner for two per problem. The kicker isn’t the math. It’s that the simplest version of the system, the one with almost no scaffolding, did most of the heavy lifting.

How AlphaProof Nexus Actually Cracks Erdős Problems

According to the research paper, AlphaProof Nexus autonomously solved 9 out of 353 open Erdős problems it attempted, settled 44 out of 492 open conjectures from the Online Encyclopedia of Integer Sequences, resolved a 15-year-old Hilbert functions question in algebraic geometry, and improved a known bound in convex optimization. Inference costs ran just a few hundred dollars per problem. The trick: rather than asking Gemini 3.1 Pro to carry an entire logical chain in natural language, the system generates proof steps in Lean’s formal language and lets the compiler check each one, feeding error messages straight back into the next attempt.

This matters because it neutralizes the single biggest weakness of LLMs in technical work: confidently wrong reasoning. The compiler is a tireless, incorruptible critic. If the model hallucinates a step, Lean rejects it, and the loop tries again. Humans only audit the final output. For any team building agents that need to operate on high-stakes logic — finance, law, infrastructure — that ground-truth feedback loop is the architectural lesson worth copying. The take: closed-loop verification, not bigger models, is how agents start producing work that domain experts will actually trust.

Why the Simplest Agent Beating the Complex One Is the Real Story

The Deepmind team built four agent variants, each more elaborate than the last. Agent (A) is just an LLM in a loop with compiler feedback. Agent (B) adds queries to AlphaProof, Google’s reinforcement-learning system for olympiad math. Agent (C) introduces an evolutionary component inspired by AlphaEvolve, with rating agents built on Gemini 3.0 Flash scoring proof sketches using an Elo system. Agent (D) combines everything. Post-hoc analysis revealed that the bare-bones Agent (A) could prove all nine solved Erdős problems on its own, just at higher cost on the hardest cases.

The researchers describe this as “an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable.” That sentence should be tattooed on every AI architect’s wall. If you’re a team weighing whether to build elaborate orchestration layers, that’s direct evidence the elaborate layer might be obsolete before you ship it. The decision between AI agents and traditional automation comes down to how much scaffolding you really need versus how much the base model already handles. The prediction: by the next model generation, half of today’s multi-agent frameworks will look like over-engineered relics, and the winners will be the teams who bet on tight feedback loops over baroque graphs.

Where Formal Verification Beats Raw Reasoning

The AlphaProof Nexus successes cluster in combinatorics, convex optimization, and number theory — areas where Lean’s Mathlib library is mature and problems decompose into clean sub-goals. The researchers acknowledge that most Erdős problems remain out of reach, “let alone problems that require extensive new theory.” But mathematicians working with the system reported that even failed attempts deepened their understanding, because formal sketches let experts focus on the unsolved sub-goals instead of re-checking entire arguments. The system also caught flawed formalizations in existing literature.

Developers should pay attention here. The economic value isn’t only in the final answer — it’s in the verifiable intermediate work product. Imagine you’re a fintech compliance team running an AI workflow that audits regulatory filings: even when the agent can’t produce a final ruling, a partial formal trace tells your humans exactly which clauses to scrutinize. That’s a different unit economics than today’s “chatbot says yes or no” pattern. The take: the next wave of useful agents won’t be the ones that pretend to be done — they’ll be the ones that hand humans a clean partial result with the gaps marked.

What Erdős Benchmarks Tell Us About the Broader Race

The Erdős problem set has quietly become the de facto benchmark for AI mathematical reasoning. OpenAI recently used a proprietary reasoning model to disprove Erdős’s unit-distance conjecture, which Fields Medalist Tim Gowers called “a milestone in AI mathematics.” GPT-5.2 Pro helped solve Erdős problem #281, with Terence Tao describing it as “perhaps the most unambiguous instance” of an LLM solving an open math problem. GPT-5.4 cracked another one shortly after. Tao has also tempered the hype: AI’s overall Erdős success rate sits at one to two percent, concentrated on easier problems. Deepmind’s nine-of-353 result lines up almost exactly with that two-percent bar.

The divergence in strategy is the real story. OpenAI’s approach tests raw LLM capability through natural language. Deepmind’s approach is engineered for reliable everyday research use, leaning on Lean as a safety net. They’re answering different questions. If you’re a buyer deciding between a custom AI build and an off-the-shelf SaaS option, the split tells you something: do you need the bleeding-edge ceiling, or the boring-but-grounded floor? Prediction: within a year, the OpenAI camp will quietly bolt on formal verification too, because once your competitor can prove their proofs, vibes-based reasoning stops being marketable.

FAQ

Q: What is AlphaProof Nexus? A: It’s a Google Deepmind framework that combines Gemini 3.1 Pro with the Lean formal proof assistant to attack open mathematics problems. The LLM proposes proof steps, Lean’s compiler verifies them, and error messages feed back into the next attempt — meaning the language model never has to carry the entire logical chain alone.

Q: How much did it cost to solve these math problems? A: According to the research paper, inference costs ran just a few hundred dollars per solved problem. That’s far less than human research hours or comparable specialized systems would cost, though the cost rises on the hardest problems where simpler agent variants need more attempts.

Q: Does this mean AI can now do original math research? A: Partially. The system cracked 9 of 353 attempted Erdős problems and 44 of 492 OEIS conjectures, mostly in areas where Lean’s Mathlib library is mature. The researchers themselves note that problems requiring extensive new theory remain out of reach, and Terence Tao has cautioned that overall AI success rates on Erdős problems sit at just one to two percent.

Key Takeaways

  • Teams designing agents should prioritize closed verification loops over multi-agent orchestration — Deepmind’s simplest agent variant matched the most complex one on solved problems.
  • Formal verification layers (Lean, type systems, schema validators, test harnesses) are the highest-leverage scaffold you can wrap around an LLM in technical domains.
  • Domains with mature symbolic infrastructure will see AI productivity gains first; fields without formal verification tools will lag, regardless of how good the underlying models get.
  • Partial agent output with clearly marked gaps is more valuable than confident-but-unverifiable full answers — design your products to surface what the agent couldn’t prove.
  • Expect OpenAI and other frontier labs to integrate formal verification into their reasoning stacks within the next model cycle, narrowing Deepmind’s current systemic advantage.

Have a project in mind?

Tell us what you're building — we reply within 24 hours.