Digging the Pit Of Success for Agents
Digging the Pit Of Success for Agents
"The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. To the extent that we make it easy to get into trouble we fail." — Rico Mariani
The idea is old, and it's a good one. A well-designed system makes the correct thing the easy thing. You don't ask people to be careful, disciplined, or expert; you shape the environment so that the path of least resistance is also the path to a working result. You dig a pit, you make sure the paths lead to it and the walls slope inward, and users fall into success.
But engineering tool experiences for humans was hard. Digging the pit was a slog, and users would find endlessly surprising ways to avoid falling into it - or claw their way back out.
Aside
We have deep respect for the product managers who could get the instrumentation, user interviews, coworking sessions, and more to truly understand where people got stuck and why. As engineers, it was usually easier to just build a tool to shortcut that process than to do the real work of understanding the user.
Agents flipped the script
Agentic software development can be frustrating. Workflows are probabilistic; agents have no ability to self-introspect or explain themselves. But from a UX perspective, they're a blessing.
If you build a tool for an AI, it always reads the manual. You can make it run as many reps as you'd like. And if you run into a case of PEBKAC - problem exists between keyboard and chair, or between GPU and API - you can swap out your user for a better one for a modest fee.
"Bad" UX for agents is now solvable the same way all software problems are: measure and iterate. And that process turns out to to look a lot more like a traditional SDLC, because the new users can run endless repetitions, are completely auditable.
What success looks like
Before you can iterate, you need to know what you're optimizing. When it comes to an agent doing a thing, we typically care about three numbers:
- Success rate - did it get the right answer?
- Wall clock - how long did it take?
- Expense - how much did the answer cost?
These are interrelated in emergent ways. More tokens usually means both higher latency and higher cost, so wall clock and expense move together. Tool calls can dominate wall clock or be negligible depending on what they touch - a near-instant SQLite read versus an expensive BigQuery round trip.
And the bars are higher than they look. Success rates below the mid-to-high nineties are incredibly frustrating; they demo well and fall apart in practice. Wall clock matters for anything with a human in the loop. And expense is a deal-breaker - outside corporate environments, people expect software to be cheap, and slapping a $15 bill on each use of your tool is not welcoming.
Hitting those bars on a probabilistic system means measuring a lot of samples - so you want scalable, repeatable tests before anything else.
The levers
Once you can measure, you have three levers to pull:
- Model selection. Dumb, but not too dumb. A smarter model papers over harness and tool deficiencies - often at the cost of tokens - and hides the signal on what's actually wrong. Test on the model that can just do the simplest version of your task ~99% of the time. You don't want to fight the model for the basics, but you don't want it rescuing you either.
- Harness. The prompt, the toolset, and the lifecycle loop around the agent. Its job is mostly to get out of the way: expose the minimal sufficient set of tools, compel the agent to move forward, and protect it from context overload and irrelevant noise.
- Tool performance. Fundamentally a context-management problem. Give the agent just enough to get started, then feed it the right correction in the most targeted dose possible - through better defaults, error messages that suggest the next step, and progressive disclosure.
Each of these deserves its own treatment, and the interesting part is always in the specifics: which tool, which error message, which default. The right answer is rarely intuitive, which is exactly why you measure.
A worked example
None of this is hypothetical. We took our own tool, Trilogy, from a 25% agentic success rate to 80% - while cutting token usage - by doing nothing more than measuring and pulling these levers one at a time.
That story has real numbers, a few humbling surprises (our hand-curated semantic model was worse than the auto-generated one), and a hard lesson about how artificial benchmarks can mislead you. We tell it in the companion post: Digging the Pit of Success for TPC-DS.
The short version is the same as the long one. You don't get a tool agents succeed with by being clever once. You dig the pit - measure, iterate, and make the easy path the correct one - and then you let them fall in.