Trilogy NLP - History

Before Trilogy

In the mid-2010s, we supported adhoc access to data in an e-commerce company through a visual UI. This interface let users select a base source, some set of columns with basic transforms, and add various filters. The selection was broken down into dimensions and metrics, where dimensions would by grouped by and metrics were calculations. The tool hooked into a range of backends; Vertica Databases, SQL Server, SSAS cubes, and more.

Early on, we recognized that this semantic mapping could support a natural language interface - and so we hacked together a slack bot using NLTK to tokenize the role of various statements and fuzzily match against the concept/metric lists and generate schedules.

This worked well-ish, with a lot of requirements on the metadata layer. We had to maintain lists of synonyms, potential filters values, and prioritize matching. If you knew how to write a query, you could get a reliable answer - but "knowing how to ask the question" mean that the audience was only slightly broader than those would could use the UI tool.

Ultimately, the overhead of setup + the limitations of Slack as an interface meant that we pivoted away this approach, and the adhoc data tool eventually was also sunset as we moved into a more mature modern data stack based on Looker.

Trilogy and GPT 3

Intial Trilogy maturity lined up with the initial GPT hype craze. GPT-3 could reasonably interpret arbitrary human inputs, but struggled quite a bit with lost context. Giving it a question and a data model had a relatively low success rate.

Our approach here was to constrain the problem space as much as possible. We would first split all our concepts into tokens and feed those to the chatbot to identify relevant tokens to answer the question; this would then reduce the search space for a second pass with full selection, and then it would do additional passes to construct filtering. In each step, we had prompts designed to map to a very constrained output.

This achieved reasonable reliability, but at the cost of expressiveness. A prompt that got you the answer you wanted looked a lot like what you would have written in SQL.

Worse, the agents had no ability to create new metrics - if you wanted them to use calculations, those had to be predefined for selection.

GPT-3.5 and Equivalents

GPT 3.5 opened up a new frontier - reasonable agent-based workflows. We were able to loosen the original constrained design and add in fun new tools, like using wikipedia to retrieve additional context. These approaches could generally match the reliability of the original constrained prompts, enabled us to remove a lot of custom code and consolidate on langchain and standard tooling, and let us support basic calculations.

GPT-4.0 and Equivalents

GPT 4.0 was the first version where we could successfully get reliable outputs on a complex prompt. In fact, debugging now became more of a 50/50% between "did the agent get the right output for the prompt" and "did I actually get the prompt right?" It turns out SQL is startling concise. To capture all the nuance of a given query requires significant verbosity.

For example, the below prompt specifying > rather than >= is wordy; the calculation is repeated twice for emphasis.

"""Using just store sales information, list all the states and the customer count filtered to those who bought items with a current item price that is more than 1.2 (greater than, not greater than or equal) times higher than the current average item price of all other items in the same category in january of 2001.

(tip: for the sales price filter, get the average item current price by category, then compare that times 1.2 against the item price to ensure the item price is higher).

Restrict to sales in January of 2001 where the category is not null and where customer count >=10.

Order results by the customer count asc (nulls first), then the state ascending (nulls first)"""

GPT-5.0

Here, we'd expect to be able to loosen up some of the complex prompts and rely more on validation.