Digging the Pit Of Success for TPC-DS
Digging the Pit Of Success for TPC-DS
In a companion post we argued that building tools for agents is, finally, a tractable UX problem: agents always read the manual, run as many reps as you'd like, and never lie about what they did. "Bad" UX for agents is solvable the same way all software is - measure and iterate. We laid out what to measure (success rate, wall clock, expense) and the three levers to pull (model selection, harness, tool performance).
This is the post where we actually pull them. Here's how we took Trilogy from a 25% agentic success rate to 80%, while cutting token usage - and what TPC-DS had to do with both the progress and the pain.
First, some context
Trilogy is SQL without the tables: you write SQL against a metadata layer that late-binds to physical tables to fulfill your query. This has a lot of nice properties, most of which seem tailor-made for agents:
- Rich context by default.
- No fan-out and chasm traps.
- Automatic aggregate resolution.
- Strong typing and linting for fast feedback.
- All the power and expressiveness of SQL - which agents know like the back of their hand.
It's a no-brainer that it's better than raw SQL for agents. We just had to prove it.
Aside
Trilogy has existed since before agents were a thing. It turns out many of the properties that make it better for agents also make it better for humans - and humans are still a primary target.
So we set up a scrappy test harness, ran our first round of evals, and got a resounding… 25% success rate. Give the same harness raw SQL access and no Trilogy, and it scored 50%. More tokens, worse output, more time - exactly the opposite of our naive expectations.
What went wrong? To answer that, we had to pull the levers one at a time.
Lever 1: Model selection
The temptation is to reach for the smartest model available. Resist it. A smarter model papers over harness and tool deficiencies - usually by spending tokens - and in doing so it hides the very signal you're trying to read. If the agent silently recovers from a bad error message, you never learn the error message was bad.
So we test on the model that can just do the simplest version of the task ~99% of the time. Smart enough not to fight us on the basics but not so smart it rescues us from our own sharp edges. Agent-level recovery from deficiencies is exactly what we're trying to measure and remove, not benefit from.
Historical note
We're blessed to be past the days of fighting agents to output properly formatted JSON for tool calls. Basics covers a lot more than it used to, which means you can focus on higher level optimizatio.
Lever 2: The harness
The harness is just the prompt, the toolset, and the lifecycle loop. Its goal is to get out of the way. Ours had to do three things:
- Expose the minimal sufficient set of tools. Every extra tool is another way for the agent to branch. If a problem can be solved multiple ways, the agent will try multiple ways over multiple iterations, and your optimization signal inherits that noise. We focused on one golden path with tight iterative loops over the minimal toolset (pruned over time as we proved) that less tools didn't hurt performance.
- Compel the agent forward. We give the agent an explicit tool to call when it's "done." Any response that isn't that call gets a nudge from the harness to continue.
- Protect the agent. Avoid context overload and huge text dumps, filter out irrelevant info, and handle retries and backoff so the agent never has to.
If you're feeling fancy, this is where multi-agent patterns live - dedicated sub-agent flows for context-hungry tasks, a review step because "done" isn't always done, and other creative structures. We generally prefer to fix the root cause upstream: shrink the context a task needs and make the tools push toward correct solutions, rather than bolting on a second agent to verify the first. Every probabilistic step you add is another place for frustration to compound. Your mileage may vary; there are genuine cases for fan-out workflows, and when you hit them you'll want real harness separation between coordinator and worker agents. Come back here after you've gone full Gastown.
Lever 3: Tool performance
For tools, the whole game is context management: give the agent just enough to get started, then feed it the right correction in the most targeted dose possible. The major levers:
- Better defaults.
- Error messages that suggest the next step.
- Progressive disclosure.
- Making sure the information needed to progress exists somewhere.
- Agent-friendly formatting.
A few things that were a genuine struggle - mostly involving letting agents tune their own tools:
- Agents massively overindex on the case in front of them. Let one tune a tool and you get a mishmash of hyper-specialized guidance bolted on for one query it just saw.
- They are poor diagnosticians. Agents can do reasonable trajectory analysis under close supervision, but accepting their default root-cause analysis without manually confirming is a recipe for sadness. They tend to speculate withotu confirming.
- They are not objective oracles Claude, for instance, tends to assume a task failure means the model underneath is bad (it was quite sure DeepSeek was the problem), rather than investigating - when on our relatively straightforward tasks, the model is rarely the actual culprit.
So why did we score so low?
With the levers in hand, the diagnosis came into focus. Our pit of success was shallow and hard to find.
- No worn path to the pit. We gave agents no concrete examples of complex syntax, so they were unlikely to luck into it whenever it varied even slightly from SQL.
- Walls that didn't slope inward. When an agent hit an error, the message didn't reliably guide it toward a fix - so it would misdiagnose the problem and spiral.
- An untested API surface. We had an enormous surface area that had never been pressure-tested against the ways a query can be mis-written. It's hard to imagine every wrong path when you already know the language. Agents were tremendously good at finding the unexpected sharp corners - and genuine bugs.
- Ironically, bad human modeling. With Trilogy we ship a semantic model to the agent and discourage it from poking at the raw database. So when the model had bad context, the agent made bad choices. Our auto-generated semantic model was initially better than our hand-curated one - humans had simply been lazy. Patching up the hand-curated model closed that gap and was a key to finally beating raw SQL.
And then there was the test itself
The last culprit was the benchmark. Our evals were built on the TPC-DS suite - the same one we use to verify the language generates correct, performant SQL. It's a fantastic tool for that: a hard schema you didn't design, reference outputs you can diff against, and hand-written SQL as a performance bar.
But TPC-DS is, in many ways, deeply artificial. Some queries have row duplication, weird joins, and cryptic intent (Query 99 makes "count of shipped orders" genuinely ambiguous - orders, or orders-times-items?). A human analyst shrugs and picks a reading. An agent spends a lot of tokens probing to understand why it's getting odd results - because it assumes the oddity is its own fault. The benchmark that was perfect for stress-testing the language isn't a perfect for the actual cases where we think we'd do better than raw SQL.
That's the trap worth remembering: a benchmark is a pit you dig for yourself, and the shape of the pit decides what you learn. TPC-DS digs a deep, honest pit for language correctness - and a slightly crooked one for agent success. Knowing which is which is half the battle.
Tips
We're working on additional benchmarks that better represent a messy warehouse. Hex has some fantastic descriptions of how they needed to create a synthetic benchmark
Back to better
So we climbed. Examples of complex syntax, so the path to the pit was worn and obvious. Error messages that suggested the next step, so the walls sloped inward. A pressure-tested API and a patched-up semantic model, so the context the agent stood on was solid. And evals read with an eye for which failures were Trilogy's fault and which were TPC-DS being TPC-DS.
The result: 25% to 80%, with lower token usage - now comfortably ahead of raw SQL instead of behind it. No single heroic fix; just measure, find the lever, pull it, measure again.
We're not at the mid-to-high nineties we ultimately want yet. But the pit is deeper and easier to find than it was, and every eval tells us exactly where to dig next. Because that's the only place the pit goes.