Demo - Exploring the Titanic
This titanic demo is a simple example of how Trilogy syntax works. It uses the common titanic dataset, a single table with the following fields about passengers.
Our First Queries
TIP
This demo uses an in-memory DuckDB database that may take some time to start for your first query. Trilogy is database agnostic; the same syntax will work on any backend such as Postgres, Bigquery, or Snowflake.
Basic Select
select passenger.name, passenger.id limit 5;
How many survived?
select
passenger.survived,
passenger.id.count,
count(passenger.id) as passenger_count_alt,
count(passenger.id) -> passenger_count_alt_2
;
The Model
Those queries could run because we have a model defined already; we'll dig into this more later. For now, just note that each concept is declared with a type, we have one derived metric, and there is a datasource defined that maps these concepts to a table.
TIP
A model describes the relationships between Trilogy concepts and the underlying data. It is a contract that can be used to generate queries.
key passenger.id int; #numeric identifier created for dataset
property passenger.id.age int?; # can be null, age of passenger
property passenger.id.survived int; # 1 = survived, 0 = died
property passenger.id.name string; # full name
property passenger.id.class int; # class of passenger, 1-3
property passenger.id.fare float; # the price paid by passenger
property passenger.id.cabin string?; # can be null, the cabin the passenger was in
property passenger.id.embarked bool;
property passenger.id.last_name <- split(passenger.name,',')[1];
metric passenger.id.count <- count(passenger.id);
datasource raw_data (
passengerid:passenger.id,
age:passenger.age,
survived:passenger.survived,
pclass:passenger.class,
name:passenger.name,
fare:passenger.fare,
cabin:passenger.cabin,
embarked:passenger.embarked,
)
grain (passenger.id)
address raw_titanic;
Derived Concepts and Filtering
TIP
Last_name is defined in the model as splitting a passenger name by a comma and taking the second element. We'll use this derived 'last_name' as a rough family identifier in these queries; if you explore in the sandbox you'll find that's not quite right, but it's good enough for now.
Let's run through some examples
Family Sizing (By Last Name)
select
passenger.last_name,
count(passenger.id) as family_size
order by
family_size desc
limit 5;
How big were families in each class?
select
passenger.class,
max(
count(passenger.id) by passenger.last_name
) by passenger.class
as max_family_size_per_class,
avg(
count(passenger.id) by passenger.last_name
) by passenger.class
-> avg_family_size_per_class,
order by
passenger.class asc
;
How many survived from each family (by last name)?
auto surviving_passenger <-
filter passenger.id where passenger.survived = 1;
select
passenger.last_name,
passenger.id.count,
count(surviving_passenger) as surviving_size
order by
passenger.id.count desc
limit 5;
Families (by last name) where everyone survived
auto surviving_passenger <- filter passenger.id where passenger.survived = 1;
select
passenger.last_name,
passenger.id.count,
count(surviving_passenger) as surviving_size
having
passenger.id.count = surviving_size
order by
passenger.id.count desc
limit 5;
Get the first name, last name, and age of the oldest survivor in each family
where
passenger.survived = 1 and passenger.age is not null
select
--passenger.id,
passenger.name,
passenger.last_name,
passenger.age,
--row_number passenger.id over
passenger.last_name
order by passenger.age desc
as eldest
having
eldest = 1
order by passenger.last_name desc
limit 5;
Find names of the 1st and 3rd oldest survivor in each family
with survivors_and_rank as
where
passenger.survived = 1 and passenger.age is not null
select
passenger.id,
passenger.name,
passenger.last_name,
passenger.age,
row_number passenger.id over
passenger.last_name
order by passenger.age desc
as age_rank;
where
survivors_and_rank.age_rank in (1,3)
select
survivors_and_rank.passenger.name,
survivors_and_rank.passenger.last_name,
survivors_and_rank.age_rank,
max(survivors_and_rank.age_rank)
by survivors_and_rank.passenger.last_name
-> number_of_survivors
having
number_of_survivors = 3
order by
survivors_and_rank.passenger.last_name desc
limit 10
;
Sandbox
Now, try writing your own queries in the sandbox below.
The query box is stateful; a successful query (such as adding a concept) will be saved and can be referenced in future queries.
TIP
For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!
Can you answer these questions? (click to show a possible answer)
- Did different classes have different average fares? >
- Were people in higher classes more likely to survive? >
- Were certain ages more likely to survive? >
- What was the average family (assume one last name is one family) survival rate in each class? >
The following concepts are predefined for you and can be referenced by name.
Available Concepts
Multiple Tables
You've been able to do some initial analysis on the titanic dataset, but now your data engineer has gotten excited about someone named Kimball. They've refactored your dataset to normalize it, and now you have the following tables.
- fact_titanic
- dim_class
- dim_passenger
Let's see how we can use Trilogy to query this new dataset.
Basic Select
select passenger.name, passenger.id limit 5;
How many survived?
select
passenger.survived,
passenger.id.count,
count(passenger.id) as passenger_count_alt,
count(passenger.id) -> passenger_count_alt_2
;
Family Sizing (By Last Name)
select
passenger.last_name,
count(passenger.id) as family_size
order by
family_size desc
limit 5;
How big were families in each class?
select
passenger.class,
max(
count(passenger.id) by passenger.last_name
) by passenger.class
as max_family_size_per_class,
avg(
count(passenger.id) by passenger.last_name
) by passenger.class
-> avg_family_size_per_class,
order by
passenger.class asc
;
How many survived from each family (by last name)?
auto surviving_passenger <-
filter passenger.id where passenger.survived = 1;
select
passenger.last_name,
passenger.id.count,
count(surviving_passenger) as surviving_size
order by
passenger.id.count desc
limit 5;
Families (by last name) where everyone survived
auto surviving_passenger <- filter passenger.id where passenger.survived = 1;
select
passenger.last_name,
passenger.id.count,
count(surviving_passenger) as surviving_size
having
passenger.id.count = surviving_size
order by
passenger.id.count desc
limit 5;
Get the first name, last name, and age of the oldest survivor in each family
where
passenger.survived = 1 and passenger.age is not null
select
--passenger.id,
passenger.name,
passenger.last_name,
passenger.age,
--row_number passenger.id over
passenger.last_name
order by passenger.age desc
as eldest
having
eldest = 1
order by passenger.last_name desc
limit 5;
Find names of the 1st and 3rd oldest survivor in each family
with survivors_and_rank as
where
passenger.survived = 1 and passenger.age is not null
select
passenger.id,
passenger.name,
passenger.last_name,
passenger.age,
row_number passenger.id over
passenger.last_name
order by passenger.age desc
as age_rank;
where
survivors_and_rank.age_rank in (1,3)
select
survivors_and_rank.passenger.name,
survivors_and_rank.passenger.last_name,
survivors_and_rank.age_rank,
max(survivors_and_rank.age_rank)
by survivors_and_rank.passenger.last_name
-> number_of_survivors
having
number_of_survivors = 3
order by
survivors_and_rank.passenger.last_name desc
limit 10
;
This should look pretty familiar. What's going on? Take a look at the queries being generated by clicking the 'sql' tab to confirm that something has changed.
The Model
Let's look at our new model.
key passenger.id int;
key _class_id int;
property passenger.id.age int?;
property passenger.id.survived bool;
property passenger.id.name string;
property passenger.id.fare float;
property passenger.id.cabin string?;
property passenger.id.embarked int;
property passenger.last_name <- index_access(split(passenger.name,','),1);
datasource dim_passenger (
id:passenger.id,
age:passenger.age,
name:passenger.name,
last_name:passenger.last_name
)
grain (passenger.id)
address dim_passenger;
datasource fact_titanic (
passengerid:passenger.id,
survived:passenger.survived,
class_id:_class_id,
fare:passenger.fare,
cabin:passenger.cabin,
embarked:passenger.embarked
)
grain (passenger.id)
address fact_titanic;
datasource dim_class (
id:_class_id,
class:passenger.class
)
grain (_class_id)
address dim_class;
Note that our concepts are almost unchanged. We've added a new _class_id concept to capture the new surrogate key added to dim_class.
We've also changed the datasource definitions to point to the new tables.
But as a consumer - you haven't needed to change a thing. We are still encoding the same logical concepts, and the semantic relations between them have not changed.
This is the convenience of separating the query language from the data model. The data model expresses a contract that can be evolved independently of the underlying materialized database tables, enabling transparent refactoring, aggregation, and remodeling to reflect the changing needs of the business.
Multi-Table Sandbox
Try querying the new model in the sandbox below.
Extending the Model
A typical Trilogy script will be based on importing existing models.
import passengers as passenger;
Often, those moodels may themselves contain imports, which together can create a complex model.
A passenger model may import a family model, for example, enabling nested references like the below.
import passengers as passenger;
select passenger.family.last_name, passenger.family.size;
For our titanic model, imagine if someone else had set up a reference model of the richest families in the world.
We want to merge this family information in with our titanic model.
There are two approaches we can use; one is to merge the results of two models within a single query, and one is to merge the two models on a shared concept so that we can build many dynamic queries against them.
Query Merges
Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.
This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.
SELECT 1
MERGE
SELECT 2
ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
AND
<align_alias_2>:<select 1 concept>,<select 2 concept>
Model Merges
Model merges are more straightforward syntactically. Conceptually, the represent that the two concepts are synonymous but may not have all the same values; they can be used to bridge concept references across two models.
Let's try both out:
MERGE <concept1> into <modifiers?><concept2>
Query Merge
# this is already imported, commented for clarity
# import rich_families as rich_info;
SELECT
passenger.last_name,
count(passenger.id) -> family_count
MERGE
SELECT
rich_info.last_name,
rich_info.net_worth_1918_dollars
ALIGN
join_last_name:passenger.last_name, rich_info.last_name
WHERE
rich_info.net_worth_1918_dollars is not null
AND passenger.last_name is not null
ORDER BY
family_count desc;
Model Merge
# this is already imported, commented for clarity
# import rich_families as rich_info;
MERGE rich_info.last_name into ~passenger.last_name;
SELECT
passenger.last_name,
count(passenger.id)-> family_count,
rich_info.net_worth_1918_dollars,
WHERE
rich_info.net_worth_1918_dollars is not null
and passenger.last_name is not null
ORDER BY
family_count desc ;
TIP
Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.
Saving Results / ETL
Imagine you want to create tables or save the outputs of a query to power a dashboard.
Trilogy supports this through the persist
keyword.
This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.
In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.
TIP
In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy enables you to explicitly define and manage those caches via persist statements.
The first query here shows a persist command; the second shows how the generated query will reference the persisted value.
Basic Persist
property passenger.id.split_cabin <- unnest(split(passenger.cabin, ' '));
persist cabin_info into dim_cabins from
select
passenger.id,
passenger.split_cabin;
Query Our Persisted Table
select passenger.id, passenger.split_cabin limit 20;
TrilogyT
TrilogyT is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.
You can try TrilogyT in the Trilogyt demo or read more about it in the docs.
Benefits of A Semantic Layer
A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.
Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.
This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.
Trilogy-NLP / GenAI
Trilogy-NLP is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.
WARNING
Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.
Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.
You can experiment with querying the titanic dataset we've defined using Trilogy-NLP below.
TIP
You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.
TPC-DS Comparisons
TPC-DS is a common database benchmarking tool that models a warehouse retail database. Ensuring that we can accurately replicate the queries against the TPC-DS schema is a good test of the expressive power of Trilogy. You can read more about test cases against TPC-DS in our blog about TPC-DS and trilogy.
These queries provide a view of how Trilogy looks across a larger data model spanning many tables, with some light commentary.
WARNING
This section is a work in progress and will be expanded over time.
Query 3 - Brand Sales
This is a fairly straightforward query in both languages.
SQL
SELECT dt.d_year,
item.i_brand_id brand_id,
item.i_brand brand,
sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,
store_sales,
item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id = 128
AND dt.d_moy=11
GROUP BY dt.d_year,
item.i_brand,
item.i_brand_id
ORDER BY dt.d_year,
sum_agg DESC,
brand_id
LIMIT 100;
Trilogy
Trilogy looks very similar, without the group by.
import store_sales as store_sales;
WHERE
store_sales.date.month_of_year=11 and store_sales.item.manufacturer_id=128
SELECT
store_sales.date.year,
store_sales.item.brand_id,
store_sales.item.brand_name,
sum(store_sales.ext_sales_price)->sum_agg
ORDER BY
store_sales.date.year asc,
sum_agg desc,
store_sales.item.brand_id asc
limit 100;
Query 1 - Customer Returns
This is a more complicated query.
SQL
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;
Trilogy
Here, we'll use an inline calculation of total returns by customer /store, and then reuse that immediately to calculate the average customer returns by store.
Since these are newly derived aggegates, we'll filter them in the having clause.
import store_returns as returns;
WHERE
returns.store.state = 'TN' and
returns.return_date.year = 2000
SELECT
returns.customer.text_id,
--sum(returns.return_amount)-> total_returns, # returns by store
--returns.store.id, # this query implicity groups by text_id and store, but only returns text_id
--avg(total_returns) by returns.store.id -> avg_store_returns # verage off the previous calc
HAVING
total_returns > (1.2*avg_store_returns)
ORDER BY
returns.customer.text_id asc
limit 100;