Thesis
Thesis
Trilogy Aspires To
- simplify hard parts of SQL
- keep and enhance the good
- incentivize the best
- add modern reusability, type-checking, and ergonomics
It should be approachable to query by someone that knows only SQL, and easy to model for someone familiar with SQL or python.
To capture the iterative SQL development loop, Trilogy includes a lightweight definition language directly in the language, allowing you query and evolve the semantic layer in the same tool - and even in the same query session. This definition language (semantic layer, ERD, etc), providing static typing/enforcement, and handling joins [1-1, 1-many, many to 1], aggregations, filtering and nulls automatically.
Trilogy looks like a SQL select, without the FROM or GROUP BY clauses.
WHERE
SELECT
#FROM - moved to semantic layer
#GROUP BY - resolved automatically
HAVING
ORDER_BY
Tips
SQL is declarative; Trilogy queries should be even more declarative.
What it looks like
-- import a reusable model
import store_sales as store_sales;
-- where clause should come first, but can come later
WHERE
store_sales.date.year=2001
and store_sales.date.month_of_year=1
and store_sales.item.current_price > 1.2 * avg(store_sales.item.current_price) by store_sales.item.category
-- select looks normal
SELECT
store_sales.customer.state,
count(store_sales.customer.id) as customer_count
-- having filters the output of select
HAVING
customer_count>10
-- order by is normal
ORDER BY
customer_count asc nulls first,
store_sales.customer.state asc nulls first
LIMIT 100;
Vs SQL
SELECT
a.ca_state state,
count(*) cnt
FROM customer_address a ,
INNER JOIN customer c on a.ca_address_sk = c.c_current_addr_sk
INNER JOIN store_sales s on c.c_customer_sk = s.ss_customer_sk
INNER JOIN date_dim d on s.ss_sold_date_sk = d.d_date_sk
INNER JOIN item i on s.ss_item_sk = i.i_item_sk
WHERE
AND d.d_month_seq =
(SELECT DISTINCT (d_month_seq)
FROM date_dim
WHERE d_year = 2001
AND d_moy = 1 )
AND i.i_current_price > 1.2 *
(SELECT avg(j.i_current_price)
FROM item j
WHERE j.i_category = i.i_category)
GROUP BY
a.ca_state
HAVING count(*) >= 10
ORDER BY
cnt NULLS FIRST,
a.ca_state NULLS FIRST
LIMIT 100;
So, what's different?
Trilogy removed the need for the joins - and even the tables - via the imported model. The information about how to join tables can be encoded once and reused automatically.
Is this valuable?
Trilogy is not the first concept in this space; a non-exhaustive list of similar project is below (please open an issue on the repo if you know of more!) . There's a rich history of trying to 'fix' SQL.
These are all born out of the idea that SQL is incredibly powerful - it's as ubiquitous as it is for a reason - but that it can be better. Trilogy in particular aims to keep the core greatness of SQL while improving on the rough edges.
Trilogy asserts that most of the value of a SQL query is in the transformations and select; the joins and group by are where things go wrong. They add little value to the expressiveness of the language.
Moreoer, table definitions often lack key information for correct querying; the grain, primary and foreign keys, and the nullability of columns. Enforcement of these is spotty and engine dependent. Trilogy moves this definition to be a first-class part of the language. With this extra structure, we can automate the hard parts of a SQL query.
If successful, with Trilogy you'll put a bit more thought into the data modeling, and much less into the querying.
You'll write models are easily shared and extended, preserving the adhoc value of SQL.
Tips
Understanding a data model properly pays dividends over and over in query time; Trilogy makes this 'proper' understanding something that can be reused and shared.
"Better" SQL
"Better SQL" has been a popular space. All are worth looking at! Find what works for you.
Python Semantic Layers
The 'metrics and dimensions in a DSL that builds SQL' has also had significant innovation; a selection of projects is below. Trilogy differs from this category by focusing on keeping a language quite close to SQL.
Why don't we all just use SQL, again?
SQL is fantastic.
SQL has been the de-facto language for working with data for decades. Data professionals can use a common, declarative syntax to interact with anything from local file based databases to global distributed compute clusters.
But SQL solves the wrong problem for the modern data stack.
SQL is a declarative language for reading and manipulating data in tables in SQL databases. This is a perfect fit for an application interacting with a datastore.
But in data warehouses, a table is a leaky abstraction. Users don't care about tables; tables are a means to an end. They want the data, and the table is a detail. Tables will be replicated; aggregated; cached - and the user spends all their time on the container, not the product.
Tips
SQL is a language for getting data out of tables in a databaase. Trilogy is a language for getting/transforming data in a [warehouse/lake/mart], with all the evolution, deprecation, and change that implies. The tables will change, but your query doesn't need to.
Trilogy puts the answers first
An intuitive query for data should be oriented around the outputs, not where it happens to be.
Seeing revenue by product line is a goal; the table that contains the products and the table that contains revenue are implementation details.
If you want to know: Total revenue by product line and city in 2025, as a fraction of all city revenue
This is what we want you to have to write in Trilogy
WHERE
year = 2025
select
product_line,
sum(revenue) as product_revenue,
sum(revenue) by city_name as all_city_revenue,
product_revenue/all_city_revenue as product_line_revenue_ratio,
city_name
;
But this is what you have to write
WITH tmpCity as (
SELECT
city_name,
sum(revenue) total_city,
FROM fact_revenue_latest
INNER JOIN dim_city on fact_revenue_latest.city_id = dim_city.city_id
WHERE dim_date.year = 2025
GROUP BY
city_name
)
SELECT
product_line,
sum(revenue),
total_city,
sum(revenue) / total_city,
city_name
FROM fact_revenue_latest
INNER JOIN dim_city on fact_revenue_latest.city_id = dim_city.city_id
INNER JOIN dim_date on fact_revenue_latest.date_id = dim_date.date_id
INNER JOIN tmpCity on dim_city.city_name = tmp_city.city_name
WHERE dim_date.year = 2025
GROUP BY
product_line,
total_city,
city_name
Over time, growing data teams tend towards SQL sprawl - duplicative, hard to follow, brittle adhoc scripts and pipelines - which are critical to the company. Fortune 500 companies spend millions of dollars trying to reverse engineer the original intent of SQL to document dataflow or lineage, or to refactor business logic when moving to a new database.
How Trilogy Solves This
Trilogy separates declared conceptual manipulation (ex: [Profit] = [Revenue] - [Cost]) from the database that stores columns and runs queries. This semantic layer
isn't a new concept, but Trilogy puts at close as possible to the SQL itself, in a familiar form - you define the semantic layer with the same language you use to query it, meaning adhoc extension and iteration is easy.
These concepts and their derivation are strongly typed and can be statically analyzed and tested against a given set of datasources to prove the correctness for a given expression of business logic. These two definitions - the business logic and the access layer - can then be independently evolved and validated over time.
SQL
USE AdventureWorks;
SELECT
t.Name,
SUM(s.SubTotal) AS [Sub Total],
STR(Sum([TaxAmt])) AS [Total Taxes],
STR(Sum([TotalDue])) AS [Total Sales]
FROM Sales.SalesOrderHeader AS s
INNER JOIN Sales.SalesTerritory as t ON s.TerritoryID = t.TerritoryID
GROUP BY
t.Name
ORDER BY
t.Name
Trilogy
import concepts.sales as sales;
select
sales.territory_name,
sales.sub_total,
sales.total_taxes,
sales.total_sales,
order by
sales.territory_name desc;
How does it work?
The example above cheats a little - the statement import concepts.sales as sales;
is bringing in a model definition.
As a semantic layer, Trilogy requires some up-front binding to the database before the first query can be run. The cost to model the data is incurred infrequently, and then the savings are amortized over every single user and query.
Tips
Models can be defined, extended and bound to a table in-line; you don't need a separate file/definition to get started. Unlike other semantic layers, Trilogy supports - and encourages - adhoc extension and iteration.
Read me in the concepts and references section to learn how Trilogy works under the hood, and the nuances of query design and setup.
Usage
Trilogy is designed to be easy to learn and use, and to be able to be incrementally adopted. It can be run directly as a CLI, in a GUI, or compiled to SQL and run in standard SQL tooling.
Head over to the demo to see how this semantic layer is defined and run some example queries.