The Pit of Success - Then And Now

"The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. To the extent that we make it easy to get into trouble we fail." — Rico Mariani

An old chestnut, but like many, very true. A well-designed system makes the correct path the easy path, and guides you back to it. You don't ask people to be perfect all the time; you shape the environment so that the path of least resistance takes you where you need to go. In fact, since we don't want you go back on the path, it's sloping down, and the sides are sloping up.. and we've got ourselves a pit.

So you dig a pit and users fall into success.

But digging a pit for a software tool is hard! To abuse a metaphor - you don't know the full landscape your users fall out into; you might not get the pit in the right place, or they're caught in a local minimal pit far from the big pit you want. And users are truly awful at describing why they got stuck; don't want invasive telemetry; and have better things to do than give you user studies.

Aside

We have deep respect for the product managers who could get the instrumentation, user interviews, coworking sessions, and all that to understand where people got stuck and why. As engineers, we usually just built tools that solved problems we had, because at least product discovery was easy.

Agents - Infinite Users

Agentic software development can be frustrating. Workflows are probabilistic; agents have no ability to self-introspect or explain themselves. But from a UX perspective, they're a blessing.

If you build a tool for an AI, it always reads the documentation you gave it. You can make it run as many reps as you'd like. And if you run into a case of PEBGAA - between GPU and API - you can swap out your user for a better one for a "modest" fee! (and your cheap users just keep getting better - thank you Deepseek!)

"Bad" UX for agents is now solvable the same way all software problems are: measure and iterate. And that process turns out to to look a lot more like a traditional SDLC, because you can run UX integration tests, regression tests, and unit tests, because your users are software too.

Tips

This is actually fantastically exciting for people too! Most things agents trip on a human would also hit, and most open-source tools don't have infinite UX testing, so having a reliable alternative that can suss out the issues is a great proxy for better human UX. Agents aren't people, but they get wedged in the same corners.

How we think about success

Before you can iterate, you need to know what you're optimizing. When it comes to an agent doing a thing, we typically care about three numbers:

Success rate - did it get the right answer?
Wall clock - how long did it take?
Expense - how much did the answer cost?

These are interrelated in emergent ways. More tokens usually means both higher latency and higher cost, so wall clock and expense move together. Tool calls can dominate wall clock or be negligible depending on what they touch - a near-instant SQLite read versus an expensive BigQuery round trip.

And the bars are higher than they look. Success rates below the mid-to-high nineties are incredibly frustrating; they demo well and fall apart in practice. Wall clock matters for anything with a human in the loop. And expense is a deal-breaker - outside corporate environments, people expect software to be cheap, and slapping a $15 bill on each use of your tool is not welcoming.

Hitting those bars on a probabilistic system means measuring a lot of samples - so you want scalable, repeatable tests before anything else. Did we mention cheap?

The levers

Once you can measure, you have three levers to pull to improve how much signal you get:

Model selection. Dumb, but not too dumb. A smarter model papers over harness and tool deficiencies - often at the cost of tokens - and hides the signal on what's actually wrong. Test on the model that can just do the simplest version of your task ~99% of the time. You don't want to fight the model for the basics, but you don't want it rescuing you either.
Harness. The prompt, the toolset, and the lifecycle loop around the agent. Its job is mostly to get out of the way: expose the minimal sufficient set of tools, compel the agent to move forward, and protect it from context overload and irrelevant noise.
Agentic UX What you came for. And ultimately a context-management problem. Give the agent just enough to get started, then feed it the right correction in the most targeted dose possible - through better defaults, error messages that suggest the next step, and progressive disclosure. This is the actual UX surface that you can tweak; the other two are just there ot make sure you have high signal.

Each of these deserves its own treatment, and the interesting part is always in the specifics: which tool, which error message, which default. The right answer is rarely intuitive, which is exactly why you measure.

In Practice

We took our own tool, Trilogy, from a 25% agentic success rate to 80% - while cutting token usage - by doing nothing more than measuring and pulling these levers one at a time.

That story has real numbers, a few humbling surprises (our first hand-curated semantic model was worse than the auto-generated one), and a hard lesson about how artificial benchmarks can mislead you. We tell it in the companion post: Digging the Pit of Success for TPC-DS.

You don't get a tool people succeed with without understanding why they fail. With agents, you get reproducible signal. And that let's you dig the pit you've always wanted.