Trilogy NLP - Today
Goals
Trilogy-NLP aims to be a highly reliable platform for natural language interrogation of structured database contents. It aims to achieve this by mapping natural language queries to a well-defined but expressive middle layer (Trilogy!), which then executes the SQL to retrieve results. We want to be able to effectively generate queries rapidly without direct database access.
Principles:
- High precision
- No special syntax - the agents should work with what humans would
- Performant - magic is more magic when it's fast
- One-shot - iterative cycles shouldn't be a requirement to get a good answer
Why?
A lot of text-to-sql tools rely on the bot having fairly unrestrained access to the database. This can be problematic not just from a security and cost perspective, but because it introduces latency and coupling. We have a goal of trilogy-nlp supporting high quality SQL based off ONLY the textual metadata layer. This enables easy usage even when a database doesn't exist, is expensive, or isn't directly accessible.
So trilogy-nlp is really not "text-to-sql" - it's "text-to-trilogy". If we can get to trilogy, we should be good to hit the database without issue.
This is enabled by the following:
- Robust "validation" tooling to guide the chat agent
- Comments as documentation - field labels (especially formats) are used to guide the creation of proper filtering syntax, casts, and more
- Flexible IR - the chat agent targets a json spec that is a subset of full trilogy, which is moderately robust to various approaches.
How it Works
Standard agentic, tool-calling langchain workflow.
Two separate passes:
- Pass 1: identify which datasets to use
- Pass 2: build a query off the dataset
Database Identification
This is straightforward - listing the available trilogy files in a path, which one should be imported to build a query?
Query Building
The agent is responsible for calling tools on it's own. It's given a prompt with some minimal format examples to return.
It has access to:
- Date
- Wikipedia
- List fields
- Validate a JSON response
- Submit a JSON response
We terminate when there is a submission that passes validation.
At this point, we have JSON structured IR, which we will then compile to a set of concept declarations and query.
Example:
Prompt:
for male customers who are single with College education in the year 2000 and did not come from the event OR email promotion channels. (hint - have an 'OR' group checking for 'N' for both of those).
Use the demographic inputs, use the point in time value from customer sale.
Order by the item name asc
Would become this intermediation representation:
{
"output_columns": [
{
"name": "store_sales.item.name"
},
{
"name": "average_quantity_sold",
"calculation": {
"operator": "AVG",
"arguments": [
{
"name": "store_sales.quantity"
}
]
}
},
{
"name": "average_list_price",
"calculation": {
"operator": "AVG",
"arguments": [
{
"name": "store_sales.list_price"
}
]
}
},
{
"name": "average_coupon_amount",
"calculation": {
"operator": "AVG",
"arguments": [
{
"name": "store_sales.coupon_amt"
}
]
}
},
{
"name": "average_sales_price",
"calculation": {
"operator": "AVG",
"arguments": [
{
"name": "store_sales.sales_price"
}
]
}
}
],
"filtering": {
"root": {
"values": [
{
"operator": "=",
"left": {
"name": "store_sales.customer.demographics.gender"
},
"right": {
"value": "M",
"type": "string"
}
},
{
"operator": "=",
"left": {
"name": "store_sales.customer.demographics.marital_status"
},
"right": {
"value": "S",
"type": "string"
}
},
{
"operator": "=",
"left": {
"name": "store_sales.customer.demographics.education_status"
},
"right": {
"value": "College",
"type": "string"
}
},
{
"operator": "=",
"left": {
"name": "store_sales.date.date.year"
},
"right": {
"value": "2000",
"type": "integer"
}
},
{
"boolean": "or",
"values": [
{
"operator": "=",
"left": {
"name": "store_sales.promotion.channel_event"
},
"right": {
"value": "N",
"type": "string"
}
},
{
"operator": "=",
"left": {
"name": "store_sales.promotion.channel_email"
},
"right": {
"value": "N",
"type": "string"
}
}
]
}
],
"boolean": "and"
}
},
"order": [
{
"column_name": "store_sales.item.name",
"order": "asc"
}
],
"limit": -1
}
Would output this trilogy:
import store_sales as store_sales;
WHERE
store_sales.customer_demographic.gender = 'M'
and store_sales.customer_demographic.marital_status = 'S'
and store_sales.customer_demographic.education_status = 'College'
and (store_sales.promotion.channel_email = 'N' or store_sales.promotion.channel_event = 'N')
and store_sales.date.year = 2000
SELECT
store_sales.item.name,
avg(store_sales.quantity) as avg_quantity,
avg(store_sales.list_price) as avg_list_price,
avg(store_sales.coupon_amt) as avg_coupon_amt,
avg(store_sales.sales_price) as avg_sales_price
order by
store_sales.item.name asc
limit 100
;