Most AI agents stop learning the moment they ship. They get deployed, they hit production edge cases, and then a human engineer spends weeks reverse-engineering failures into prompt tweaks. The Tax AI system that OpenAI and Thrive Holdings built for Crete’s accounting network proposes a different bargain: every time a CPA corrects a draft return, that correction becomes structured evidence that Codex can investigate, fix, and validate against evals — without an engineer manually triaging the issue first. That changes how production agents improve, and it’s worth dissecting before the pattern becomes table stakes.
The Numbers Behind a Six-Month Pilot
According to the joint write-up from OpenAI and Thrive Holdings, Tax AI processed 7,000 tax returns across Crete’s network of 30+ accounting firms during this tax season. The system targets the data-entry slog of 1040 and 1041 returns, where the original report notes that medium- to large-complexity filings can consume eight hours per return in extraction alone. The reported outcomes: drafts at up to 97% accuracy, roughly a third of preparation time eliminated, and approximately 50% higher throughput.
The more interesting number is the trajectory. At launch, only a quarter of returns hit 75% correct field completion. Within six weeks, 86% of returns cleared that bar — and the 90% and 100% completion buckets grew even faster, per the team’s own measurements. This happened while the system was expanding into harder forms: K-1s, Schedule E rental properties, and reconciliation across multiple source files. The accuracy curve climbed even as the difficulty curve did.
If you’re running an internal AI agent today and your accuracy chart is flat after launch, this is the benchmark to argue against. Static post-deployment performance is now the floor.
My take: the headline isn’t “AI does taxes.” It’s that vertical agents now have a credible operating model for compounding gains week over week, and that changes the ROI math for anyone evaluating AI agents versus traditional automation.
Why the Three-Pillar Loop Actually Works
The OpenAI team frames the system around three pillars: practitioner feedback, production traces, and a Codex-driven iteration loop tied to bounded evals. Read past the framing and what they’ve actually built is an evidence pipeline. Source documents, extracted fields with citations, tax-engine mappings, and practitioner corrections are all preserved as one continuous trace from input to filed return.
The failure mode of most agent products is signal loss. A user fixes a wrong answer, ships their work, and the system never learns whether the model was wrong, the prompt was wrong, the tool call was wrong, or whether the user simply had a personal preference. The Crete pipeline categorizes each correction — extraction miss, mapping issue, unsupported behavior, tax judgment, or workflow noise — before anything reaches Codex. Repeated patterns get grouped into bounded evals with representative source packages and expected outputs.
Picture a mid-sized accounting firm that already runs an off-the-shelf OCR vendor. The OCR misreads a fair-rental-days field, the senior accountant fixes it, and the fix dies in the file. In the Tax AI design, that same correction joins dozens of similar ones into a grouped finding, which then becomes a scoped task: “Tax AI consistently misses fair-rental-day fields on Schedule E. Here is the eval set, here is the trace, here is the schema you can modify.” That’s a hill Codex can climb autonomously.
My prediction: within twelve months, any serious vertical agent product will publish its eval pass rate the way SaaS companies publish uptime. “97% accuracy” without a structured eval harness behind it will start to sound like marketing.
What Codex Actually Does Inside the Loop
The write-up describes Codex’s role with unusual precision, which is rare for agent case studies. Codex isn’t given a vague “make this better” prompt. It receives a writable worktree containing the product surface it can modify, the targeted and regression evals that define success, and reusable skills documents that encode prior decisions. Read-only context provides the production trace, source documents, the Tax AI prediction, the finalized return, and tax-engine field documentation.
From there, Codex investigates the pipeline (extraction schemas, mapper behavior, source selection logic), implements targeted fixes, reruns the targeted eval plus a broader regression suite, and surfaces a candidate pull request. Ambiguous cases — where the evidence doesn’t clearly point to an automatable fix — get routed back to the engineering team instead of being forced through the loop. That last detail is the one I’d underline twice. Knowing when not to autonomously act is what separates production-safe agents from demo-grade ones.
For teams scoping a first agentic build, the lesson is structural. Don’t ask Codex to fix your agent. Build the harness — traces, evals, a bounded worktree, explicit validation gates — and then let Codex operate inside it. The same logic drives most successful custom AI builds versus off-the-shelf SaaS deployments: the moat is the surrounding infrastructure, not the model call itself.
My take: “harness engineering” is going to become a job title before the end of 2026. The people who can design eval-backed loops are about to be more valuable than the people who can write clever prompts.
The Human Story That Validates the Architecture
Buried near the end of the OpenAI post is the detail that makes the business case land. One senior accountant who spent 180 hours on tax prep last year spent 15 hours on it this year, according to the report. She used the recovered time to personally walk every one of her clients through their returns and to take on new business.
That’s the actual product. The agent didn’t replace the accountant. It cleared the data-entry floor so the accountant could deliver the high-touch service her firm couldn’t profitably staff before. For anyone scoping a vertical agent — in fintech, lending, or banking workflows, in legal review, in clinical documentation — that’s the framing that wins internal approval. The agent absorbs the unloved hours so the human can move up the value stack.
My prediction: the firms that adopt self-improving agents in 2026 will not differentiate on cost. They’ll differentiate on the quality of human attention their professionals can finally afford to give clients, because the drudgery is gone.
The Reusability Question Nobody Is Asking Yet
The OpenAI and Thrive teams claim the same three-part design is now being applied to bookkeeping, audit, and IT help-desk automation across the Thrive portfolio. That claim is the one to watch. The rental-property workstream took roughly six weeks of substantial engineering oversight to reach 90% precision and recall, but the team reports that the abstractions, review artifacts, and eval conventions made subsequent schedules (C and A) faster to support.
If the loop genuinely transfers across domains, the implication is that the first vertical agent a company ships is expensive, and every subsequent one is cheaper. That’s a real moat, and it’s why holding-company structures like Thrive — with direct access to practitioners and production data inside operating businesses — may have a structural advantage over pure vendors. The vendor model requires negotiating access to feedback signals. The owner-operator model already has them.
My take: expect more AI-native rollups to appear. The play isn’t “sell software to accounting firms.” The play is “own the accounting firm, instrument its workflows, and let the agent compound.”
FAQ
Q: What is a self-improving AI agent? A: It’s an agent designed so that production usage automatically generates structured signals — corrections, traces, success criteria — that feed back into the engineering loop. Instead of an engineer manually diagnosing each failure, the system surfaces grouped patterns as bounded evaluation targets that a coding agent like Codex can act on with human review gates.
Q: Can Codex really write production fixes without human oversight? A: Per OpenAI’s account, Codex proposes candidate pull requests that go through engineering review, runs targeted plus regression evals before anything ships, and routes ambiguous cases back to humans rather than forcing a fix. So the loop is autonomous in investigation and proposal, not in deployment. Architecture and shipping decisions stay with engineers.
Q: Does this approach work outside of tax preparation? A: The Thrive Holdings team reports they’re applying the same three-pillar pattern to bookkeeping, audit, and IT help-desk automation. The general requirement is a domain where expert practitioners generate clear corrections, where production traces can be captured end-to-end, and where success can be defined in bounded evals. Most document-heavy, regulated workflows do.
Key Takeaways
- Teams shipping AI agents in 2026 should plan for production trace capture from day one — corrections without context are wasted signal, and retrofitting trace infrastructure later costs more than building it upfront.
- The bottleneck for self-improving agents is not the model, it’s the eval harness; structured eval pipelines are where competitive advantage in agent development is going.
- Owner-operator models that combine engineering teams with practitioner access will likely outpace pure vendors on agent quality, because they don’t need permission to instrument the workflow.
- Expect “harness engineer” to emerge as a real role within the next year, distinct from ML engineer or prompt engineer.
- The most defensible vertical agents will be measured by their accuracy trajectory after launch, not their launch-day demo — flat post-deploy performance is about to look like a red flag.