Building self-improving tax agents with Codex

Over six months, a collaboration between OpenAI and Thrive Holdings turned Codex into a self-improving agent that processed 7,000 tax returns for Crete. Within six weeks, the system jumped from handling 25% of returns at 75% field completion to capturing 86%.

The feedback loop shifted from a manual engineering chore to an automated improvement engine.

The metrics of self-improvement

Crete's network of accounting firms prepares tens of thousands of returns annually. For medium- to large-complexity filings, data entry consumes eight hours per return, driven by messy source documents, prior-year records, and manual extraction.

Tax AI automates the preparation of 1040 and 1041 returns. The system saves practitioners approximately one-third of their preparation time, drafts returns with up to 97% accuracy, and lifts throughput by 50%. This frees capacity for higher-touch client interactions.

Performance tracked sharply upward throughout the season. Early iterations handled straightforward W-2 and 1099 forms. As the workload progressed, the agent tackled K-1s, multi-page schedules, and reconciliation edge cases. Each jump in complexity yielded greater time savings, since the marginal cost of manual work scales faster than the agent's processing overhead.

How the loop closes

Most AI deployments stall because the correction loop remains disconnected from the codebase. Engineers receive error logs, debug locally, and ship patches, but the system rarely accumulates durable knowledge. Crete faced the same trap initially: practitioners corrected errors, but the product lacked the context to distinguish a genuine extraction miss from a mapping gap or expected workflow noise.

The team resolved this by designing a three-pillar architecture:

Expert steering: Practitioners guide the learning trajectory. Their interventions flag which errors degrade the product versus which represent acceptable variance.
Production traces: Every interaction captures the full lineage - source files, extracted fields, provenance links, and downstream submissions. Corrections become structured evidence rather than silent losses.
Codex-driven iteration: Validated failures convert into bounded engineering tasks. Codex investigates the root cause, implements the fix, runs regression suites, and submits a pull request for review.

Consider a rental property scenario. Income flows through Schedule E, requiring the system to parse handwritten notes, email attachments, and spreadsheets. When the agent mispredicts a "fair rental days" field, the practitioner overrides the value. The system compares the predicted value against the filed return, groups similar deviations, and isolates the actionable failure.

The finding transforms into a scoped task. Codex receives a repository branch containing the affected modules, a YAML-defined eval suite specifying success conditions, and a read-only snapshot of the production trace. The agent inspects the extraction schema, the mapper logic, and the grading criteria. It identifies whether the bug stems from a missing regex pattern, a source-selection flaw, or a mapper collision. After proposing a patch, Codex reruns the eval. If the metric clears the threshold, the change lands. Ambiguous cases route back to the engineering team, preventing false positives from polluting the codebase.

This structure mirrors the constraints discussed in autonomous coding hits the governance wall: velocity requires tight bounds, otherwise the agent drifts into unverified territory. The Crete implementation demonstrates that when the evals are grounded in production reality, the boundary holds.

Our read

The headline statistic - the leap from 25% to 86% completion is impressive, but the structural insight matters more. Self-improving agents are theoretically trivial; practically, they collapse under the weight of noisy feedback and brittle pipelines. The breakthrough here is treating practitioner corrections as high-fidelity training data rather than customer support tickets.

Two factors enabled this loop. First, the product surface is tightly bounded. The agent handles extraction and mapping; humans handle tax judgment and final approval. This division of labor keeps the agent's action space small and verifiable. Second, Thrive Holdings operates as both owner and operator. Vendors typically struggle to secure the raw data needed to train specialized models, caught in procurement delays and data-sharing agreements. By embedding engineering teams inside the operating company, Thrive eliminated the friction. Feedback travels instantly from the front desk to the repo.

The blueprint extends beyond tax. The same loop applies to bookkeeping, audit workflows, and IT helpdesk automation. Any domain where experts perform repetitive, document-heavy tasks with predictable outcomes is ripe for this architecture. The agents won't eliminate the workforce; they will force a redefinition of scope. Senior staff who previously spent 180 hours on manual prep reduced that load to 15 hours, reallocating time to client calls and new service offerings. The margin lives in the transition from execution to advisory.

The race shifts from building bigger models to building tighter loops. Systems that treat production as a continuous stream of labeled data will outpace those treating AI as a batch optimization exercise.

Reporting from OpenAI.

Building self-improving tax agents with Codex

The metrics of self-improvement

How the loop closes

Our read

The Signal

Key takeaways

What to watch next

Who should care

Key players

Sources

One sharp read on the day’s biggest tech story.

Related reading

Cognition Raises $1B at $26B Valuation as Agents Replace Copilots

Autonomous coding ships fast. Governance breaks the loop.

Understand Anything Turns Codebases and Knowledge Bases Into Interactive Graphs