OpenAI and Thrive Holdings fused practitioner feedback with Codex to build a tax agent that auto-improved over six months. Processing 7,000 returns, the system raised field completion from 25% to 86% in six weeks.
Over six months, a collaboration between OpenAI and Thrive Holdings turned Codex into a self-improving agent that processed 7,000 tax returns for Crete. Within six weeks, the system jumped from handling 25% of returns at 75% field completion to capturing 86%.
The feedback loop shifted from a manual engineering chore to an automated improvement engine.
The metrics of self-improvement
Crete's network of accounting firms prepares tens of thousands of returns annually. For medium- to large-complexity filings, data entry consumes eight hours per return, driven by messy source documents, prior-year records, and manual extraction.
Tax AI automates the preparation of 1040 and 1041 returns. The system saves practitioners approximately one-third of their preparation time, drafts returns with up to 97% accuracy, and lifts throughput by 50%. This frees capacity for higher-touch client interactions.
Performance tracked sharply upward throughout the season. Early iterations handled straightforward W-2 and 1099 forms. As the workload progressed, the agent tackled K-1s, multi-page schedules, and reconciliation edge cases. Each jump in complexity yielded greater time savings, since the marginal cost of manual work scales faster than the agent's processing overhead.
How the loop closes
Most AI deployments stall because the correction loop remains disconnected from the codebase. Engineers receive error logs, debug locally, and ship patches, but the system rarely accumulates durable knowledge. Crete faced the same trap initially: practitioners corrected errors, but the product lacked the context to distinguish a genuine extraction miss from a mapping gap or expected workflow noise.
The team resolved this by designing a three-pillar architecture:
Expert steering: Practitioners guide the learning trajectory. Their interventions flag which errors degrade the product versus which represent acceptable variance.
Production traces: Every interaction captures the full lineage - source files, extracted fields, provenance links, and downstream submissions. Corrections become structured evidence rather than silent losses.
Codex-driven iteration: Validated failures convert into bounded engineering tasks. Codex investigates the root cause, implements the fix, runs regression suites, and submits a pull request for review.
Consider a rental property scenario. Income flows through Schedule E, requiring the system to parse handwritten notes, email attachments, and spreadsheets. When the agent mispredicts a "fair rental days" field, the practitioner overrides the value. The system compares the predicted value against the filed return, groups similar deviations, and isolates the actionable failure.
The finding transforms into a scoped task. Codex receives a repository branch containing the affected modules, a YAML-defined eval suite specifying success conditions, and a read-only snapshot of the production trace. The agent inspects the extraction schema, the mapper logic, and the grading criteria. It identifies whether the bug stems from a missing regex pattern, a source-selection flaw, or a mapper collision. After proposing a patch, Codex reruns the eval. If the metric clears the threshold, the change lands. Ambiguous cases route back to the engineering team, preventing false positives from polluting the codebase.
This structure mirrors the constraints discussed in autonomous coding hits the governance wall: velocity requires tight bounds, otherwise the agent drifts into unverified territory. The Crete implementation demonstrates that when the evals are grounded in production reality, the boundary holds.
Our read
The headline statistic - the leap from 25% to 86% completion is impressive, but the structural insight matters more. Self-improving agents are theoretically trivial; practically, they collapse under the weight of noisy feedback and brittle pipelines. The breakthrough here is treating practitioner corrections as high-fidelity training data rather than customer support tickets.
Two factors enabled this loop. First, the product surface is tightly bounded. The agent handles extraction and mapping; humans handle tax judgment and final approval. This division of labor keeps the agent's action space small and verifiable. Second, Thrive Holdings operates as both owner and operator. Vendors typically struggle to secure the raw data needed to train specialized models, caught in procurement delays and data-sharing agreements. By embedding engineering teams inside the operating company, Thrive eliminated the friction. Feedback travels instantly from the front desk to the repo.
The blueprint extends beyond tax. The same loop applies to bookkeeping, audit workflows, and IT helpdesk automation. Any domain where experts perform repetitive, document-heavy tasks with predictable outcomes is ripe for this architecture. The agents won't eliminate the workforce; they will force a redefinition of scope. Senior staff who previously spent 180 hours on manual prep reduced that load to 15 hours, reallocating time to client calls and new service offerings. The margin lives in the transition from execution to advisory.
The race shifts from building bigger models to building tighter loops. Systems that treat production as a continuous stream of labeled data will outpace those treating AI as a batch optimization exercise.
Tightening the feedback loop between production usage and code iteration delivers more reliable ROI than simply scaling model size.
Stance · BullishConfidence · Emerging
The piece validates a scalable architectural pattern that solves persistent AI deployment bottlenecks with documented efficiency gains and clear expansion potential.
Key takeaways
A three-pillar architecture converting practitioner corrections into structured production traces enables rapid performance gains, lifting field completion from 25 percent to 86 percent in six weeks.
Strict role separation keeps the agent’s action space small and verifiable, with AI handling extraction and mapping while humans retain final judgment.
Embedding engineering teams directly inside the operating company removes data-sharing friction, allowing instantaneous feedback transmission from frontline practice to the repository.
Competitive advantage is shifting from developing larger foundational models to building tighter, production-grounded evaluation loops.
What to watch next
Adoption velocity in adjacent document-heavy workflows like bookkeeping and audit
Development of standardized validation protocols for production-trace feedback loops
Evolution of governance frameworks for autonomous code submission in regulated sectors
Who should care
AI product managersEnterprise ops leadersSoftware engineersProfessional services firms
Key players
OpenAIThrive HoldingsCreteCodex
Auto-generated from the article by our model — a reading aid, not a replacement for the piece.