Digging the Pit Of Success for TPC-DS

In a companion post we argued that building tools for agents is, finally, a tractable UX problem: agents always read the manual, run as many reps as you'd like, and never lie about what they did. "Bad" UX for agents is solvable the same way all software is - measure and iterate. We laid out what to measure (success rate, wall clock, expense) and the three levers to pull (model selection, harness, tool performance).

This is our shame post about how we needed to pull those levers to get our language to be as good as we thought it was. A lot of hubris and some lessons along the way - taking Trilogy from a 25% agentic success rate to 80%, while cutting token usage - and what TPC-DS had to do with both the progress and the pain.

First, some context

Trilogy is SQL without the tables: you write SQL against a metadata layer that late-binds to physical tables to fulfill your query. This has a lot of nice properties, most of which seem tailor-made for agents:

Rich context by default, with tools for progressive disclosure.
No fan-out and chasm traps.
Automatic aggregate resolution.
Strong typing and linting for fast feedback.
All the power and expressiveness of SQL - which agents know like the back of their hand.

It's a no-brainer that it's better than raw SQL for agents. Every semantic layer blog said so, and we agreed with them. We just had to prove it.

Aside

Trilogy has existed since before agents were a thing. It turns out many of the properties that make it better for agents also make it better for humans - and humans are still a primary target.

So we set up a scrappy test harness, ran our first round of evals, and got a resounding… 25% success rate. Give the same harness raw SQL access and no Trilogy, and it scored 50%. More tokens, worse output, more time - exactly the opposite of our naive expectations.

What went wrong? To answer that, we had to follow the steps - find our agent proxy; fix our harness, and last make the tool work.

Lever 1: Model selection

The temptation is to reach for the smartest model available, because that makes your tool look great. It's bad for optimization, though - a smarter model papers over harness and tool deficiencies and in doing so it hides the very signal you're trying to read. If the agent silently recovers from a bad error message, you never learn the error message was bad.

So we test on the model that can just do the simplest version of the task ~99% of the time. Smart enough not to fight us on the basics but not so smart it rescues us from our own sharp edges. Agent-level recovery from deficiencies is exactly what we're trying to measure and remove, not benefit from.

Historical note

We're blessed to be past the days of fighting agents to output properly formatted JSON for tool calls. Basics covers a lot more than it used to, which means you can focus on higher level optimization.

Lever 2: The harness

The harness is just the prompt, the toolset, and the lifecycle loop. Ours had to do three things:

Expose the minimal sufficient set of tools. Every extra tool is another way for the agent to branch. If a problem can be solved multiple ways, the agent will try multiple ways over multiple iterations, and your optimization signal inherits that noise. We focused on one golden path with tight iterative loops over the minimal toolset (pruned over time as we proved) that less tools didn't hurt performance.
Compel the agent forward. We give the agent an explicit tool to call when it's "done." Any response that isn't that call gets a nudge from the harness to continue.
Protect the agent. Avoid context overload and huge text dumps, filter out irrelevant info, and handle retries and backoff so the agent never has to.

If you're feeling fancy, this is where multi-agent patterns live - dedicated sub-agent flows for context-hungry tasks, a review step because "done" isn't always done, and other creative structures. We generally prefer to fix the root cause upstream: shrink the context a task needs and make the tools push toward correct solutions, rather than bolting on a second agent to verify the first. Every probabilistic step you add is another place for frustration to compound. Your mileage may vary; there are genuine cases for fan-out workflows, and when you hit them you'll want real harness separation between coordinator and worker agents. Come back here after you've gone full Gastown; it's fun out there but you can get lost in optimizing the machine not the results.

Lever 3: Tool performance

For tools, the whole game is context management: give the agent just enough to get started, then feed it the right correction in the most targeted dose possible. The major levers:

Better defaults.
Error messages that suggest the next step.
Progressive disclosure.
Making sure the information needed to progress exists somewhere.
Agent-friendly formatting.

A few things that were a genuine struggle - mostly involving letting agents tune their own tools:

Agents massively overindex on the case in front of them. Let one tune a tool and you get a mishmash of hyper-specialized guidance bolted on for one query it just saw.
They are poor diagnosticians. Agents can do reasonable trajectory analysis under close supervision, but accepting their default root-cause analysis without manually confirming is a recipe for sadness. They tend to speculate withotu confirming.
They are not objective oracles Claude, for instance, tends to assume a task failure means the model underneath is bad (it was quite sure DeepSeek was the problem, as an inferior model), rather than investigating - when on our relatively straightforward tasks, the model is rarely the actual culprit.

So why did we score so low?

With the levers in hand, the diagnosis came into focus. Our pit of success was shallow and hard to find.

No worn path to the pit. We gave agents no concrete examples of complex syntax, so they were unlikely to luck into it whenever it varied even slightly from SQL.
Walls that didn't slope inward. When an agent hit an error, the message didn't reliably guide it toward a fix - so it would misdiagnose the problem and spiral.
An untested API surface. We had an enormous surface area that had never been pressure-tested against the ways a query can be mis-written. It's hard to imagine every wrong path when you already know the language. Agents were tremendously good at finding the unexpected sharp corners - and genuine bugs.
Ironically, bad human modeling. With Trilogy we ship a semantic model to the agent and discourage it from poking at the raw database. So when the model had bad context, the agent made bad choices. Our auto-generated semantic model was initially better than our hand-curated one - humans had simply been lazy. Patching up the hand-curated model closed that gap and was a key to finally beating raw SQL.

And then there was the test itself

The last culprit was the benchmark. Our evals were built on the TPC-DS suite - the same one we use to verify the language generates correct, performant SQL. It's a fantastic tool for that: a hard schema you didn't design, reference outputs you can diff against, and hand-written SQL as a performance bar.

But TPC-DS is, in many ways, deeply artificial. Some queries have row duplication, weird joins, and cryptic intent (Query 99 makes "count of shipped orders" genuinely ambiguous - orders, or orders-times-items?). A human analyst shrugs and picks a reading. An agent spends a lot of tokens probing to understand why it's getting odd results - because it assumes the oddity is its own fault. The benchmark that was perfect for stress-testing the language isn't a perfect for the actual cases where we think we'd do better than raw SQL.

Any measure is a proxy for your end user experience; optimizing for the wrong measure will optimize for the wrong uers. We're not mad that we started with tpc-ds, but we need other evals to round out our delivery.

Tips

We're working on additional benchmarks that better represent a messy warehouse. Hex has some fantastic descriptions of how they needed to create a synthetic benchmark.

Back to better

Some concrete fixes, to ground this:

New Syntax

Trilogy isn't SQL. Agents, even with examples, tend to fall back to familiar paths. We have two methods to coerce them;

Better Errors

Better errors. When they do something that makes no sense - such as including a group by clause - strong pointed reminder.

Since we're trying to reduce token costs, iteration cycles kill as they contain the full context. [Yes, API providers might cache this, but still]. Ensuring that errors are returned together as a unit so multiple rounds aren't required to discover them is very useful. Broadly, we want all syntax errors to return together; then all semantic errors; then all execution errors. (hopefully the latter doesn't happen!).

Better Language

Add missing features. When an agent wants something that should exist, give it to the agent.

Relax constraints. When an agent reaches for not what you want but a reasonable alternative syntax, we can consider extending grammar to handle it. We're pretty torn on this one; but "friendly SQL" is winning out so as long as the language flexibility doesn't compromise precision.

A fun specific one there was namespace referencing for CTEs. Trilogy supports nested access - order.customer.first_name and namespacing a CTE, producing cte.order.customer.first_name. For non transformed columns, agents would default to cte.first_name instead of the full path. When told to do the full path, they'd say "that's too much work" and try a different path. So we allowed resolution of cte.first_name when it was unambigious.

This turned out to be a lot nicer to write, so thanks, agents.

End Results

We kept digging the whole deeper. It's fairly impressive how many edge cases agents can run into over a 99-query test suite. We've done hundres of rounds of optimizations:

Examples of complex syntax.
Error messages that suggested the next step, so the walls sloped inward. A pressure-tested API and a patched-up semantic model, so the context the agent stood on was solid. And evals read with an eye for which failures were Trilogy's fault and which were TPC-DS being TPC-DS.

The result: 25% to 80%, with lower token usage - now comfortably ahead of raw SQL instead of behind it. No single heroic fix; just measure, find the lever, pull it, measure again.

We're not at the mid-to-high nineties we ultimately want yet. But the pit is deeper and easier to find than it was, and every eval tells us exactly where to dig next.