Demo - Exploring the TPC-DS Dataset
This demo uses the popular TPD-DS dataset, typically used for benchmarking capability/performance for transaction workloads.
No benchmarking right now, though - we just care that DuckDB can natively produce a nice representative data warehouse for TPC-DS with a built in extension, and that this dataset gives us a good way to explore the language syntax. You can read more about Trilogy and our use of this benchmark tool here here.
TIP
This demo uses an in-memory DuckDB database that may take some time to start for your first query. Trilogy is database agnostic; the same syntax will work on any backend such as Postgres, Bigquery, or Snowflake.
Our First Queries
TPC-DS creates a nice small warehouse; 17 dimension tables and 7 fact tables.
Dimensions
Fact Tables
That's a lot to work with! Fortunately, someone has already defined a Trilogy model for TPC-DS here; we'll be using that directly. We'll discuss models in more detail in a little bit.
To start out, let's focus on these Trilogy imports. You'll notice that these map closely to fact tables; that's not a coincidence. It's typical for a user query to be centered on one or more facts, so those are natural places to start a modeling hierarchy. Trilogy imports let you expose many entrypoints into the same relationship of tables to support intuitive query patterns.
However, we're going to hide the import statements from all the queries we run here. We'll work in an environment with all of them loaded already!
TIP
In the below section, you have the opportunity to run real Trilogy queries. Click the pulsing 'run' button below each query to see results and SQL.
Basic Select
select
customer.id,
customer.full_name,
limit 5;
Another View, Similar Data
select
store_sales.customer.id,
store_sales.customer.full_name,
store_sales.ticket_number,
limit 5;
Sales By Customer
select
store_sales.customer.id,
store_sales.customer.first_name,
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales
order by
total_sales desc
limit 10;
The Model
We mentioned earlier that we had a model defined that was lettings us run these queries. Let's peek below the hood. Modeling is the heart and "art" of Trilogy; defining names and relationships that enable intuitive access to data.
Models will define some concepts, bind some concepts, and often - import other models. While models can be 1-1 to tables and often are, they don't need to be - and this can be useful when refactoring, extending, or simplifying models.
TIP
A model describes the relationships between Trilogy concepts and the underlying data. It is a contract that can be used to generate queries.
TIP
A model definition is written in the same language as these queries - the access language and the modeling language are identical in Trilogy. You can define and extend a model as part of an adhoc exploratory workload, just as you can with SQL.
This definition is for the store_sales
references we've been querying above. The full tpc-ds model contains multiple files like this.
import item as item;
import date as date;
import date as return_date;
import time as time;
import time as return_time;
import customer as customer;
import customer as return_customer;
import promotion as promotion;
import customer_demographic as customer_demographic;
import store as store;
import store as return_store;
# you can interleave modeling and queries;
select 1 -> example;
key ticket_number int;
property <ticket_number,item.id>.quantity int;
property <ticket_number,item.id>.sales_price float;
property <ticket_number,item.id>.list_price float;
property <ticket_number,item.id>.ext_sales_price float;
property <ticket_number,item.id>.ext_wholesale_cost float;
property <ticket_number,item.id>.ext_list_price float;
property <ticket_number,item.id>.ext_discount_amount float;
property <ticket_number,item.id>.coupon_amt float;
property <ticket_number,item.id>.net_profit float;
property <ticket_number,item.id>.is_returned bool;
property <ticket_number,item.id>.net_paid float;
property <ticket_number,item.id>.return_amount float?;
property <ticket_number,item.id>.return_net_loss float?;
# and derive calculated metrics
auto profit <- ext_list_price - ext_wholesale_cost - ext_discount_amount + ext_sales_price;
# bind defined values to a table
datasource store_sales (
SS_SOLD_DATE_SK: date.id,
SS_SOLD_TIME_SK: time.id,
SS_CUSTOMER_SK: customer.id,
SS_CDEMO_SK: customer_demographic.id,
SS_TICKET_NUMBER: ticket_number,
SS_ITEM_SK: item.id,
SS_SALES_PRICE: sales_price,
SS_LIST_PRICE: list_price,
SS_EXT_SALES_PRICE: ext_sales_price,
SS_EXT_LIST_PRICE: ext_list_price,
SS_EXT_WHOLESALE_COST: ext_wholesale_cost,
SS_EXT_DISCOUNT_AMT: ext_discount_amount,
SS_NET_PROFIT: net_profit,
SS_PROMO_SK: promotion.id,
SS_QUANTITY: quantity,
SS_COUPON_AMT: coupon_amt,
SS_STORE_SK: store.id,
SS_NET_PAID: net_paid,
)
# the grain is used to resolve queries appropriately
grain (ticket_number, item.id)
address memory.store_sales
;
# multiple tables can be used to source metrics
datasource store_returns(
SR_RETURNED_DATE_SK: return_date.id,
SR_RETURN_TIME_SK: return_time.id,
SR_ITEM_SK: ~item.id,
SR_CUSTOMER_SK: return_customer.id,
SR_RETURN_AMT: return_amount,
SR_TICKET_NUMBER: ~ticket_number,
SR_STORE_SK: return_store.id,
SR_NET_LOSS: return_net_loss,
# capture if there is a row in this table
bool(return_time.id): is_returned,
)
grain (ticket_number, item.id)
address memory.store_returns;
If you look at the top, there are imports. These are the key to reuse in Trilogy; they enable composition of models.
Returns by Different State
where store_sales.return_customer.state is not null
and store_sales.return_customer.state != store_sales.customer.state
select
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
store_sales.return_customer.state,
order by
total_sales desc
limit 10;
In fact, if we drill into customers, we'll see that model itself has an import. That entire extended model has been imported twice; once for the customer who bought, and once for the one that returned, enabling the same logic to be reused across both keys easily.
Beyond reuse, we want abstraction - the key to that is that datasources - which make the models concrete - can be evolved and split transparently to a consuming query. We can even use this to dynamically swap in aggregates when we compute a higher level table with the same outputs. (such as if a session runs a persist to store a query output).
import customer_demographic as demographics;
key id int;
property id.text_id string;
property id.last_name string;
property id.first_name string;
property id.preferred_cust_flag string;
property id.birth_day int;
property id.birth_month int;
property id.birth_year int;
property id.birth_country string;
property id.salutation string;
property id.email_address string;
property id.login string;
property id.last_review_date string;
# avoid having to recalculate this in queries
property id.full_name <- concat(salutation, ' ', first_name, ' ', last_name);
property id.birth_date <- cast(
concat(cast(birth_year as string),
'/', cast(birth_month as string),
'/', cast(birth_day as string)
) as date
);
key address_id int;
property address_id.address_id_string string;
property address_id.street string;
property address_id.city string;
property address_id.state string?;
property address_id.zip string;
property address_id.county string;
property address_id.country string;
datasource customers (
C_CUSTOMER_SK: id,
C_CUSTOMER_ID: text_id,
C_LAST_NAME: last_name,
C_FIRST_NAME: first_name,
C_CURRENT_ADDR_SK: address_id,
C_CURRENT_CDEMO_SK: demographics.id,
C_PREFERRED_CUST_FLAG: preferred_cust_flag,
C_BIRTH_COUNTRY: birth_country,
C_SALUTATION: salutation,
C_EMAIL_ADDRESS: email_address,
C_BIRTH_DAY: birth_day,
C_BIRTH_MONTH: birth_month,
C_BIRTH_YEAR: birth_year,
C_LOGIN: login,
C_LAST_REVIEW_DATE_SK:last_review_date,
)
grain (id)
address memory.customer;
datasource customer_address(
CA_ADDRESS_SK: address_id,
CA_STREET_NAME: street,
CA_CITY: city,
CA_STATE: state, # Two character state code; eg. CA for California, MA for Massachusetts
CA_ZIP: zip,
CA_COUNTY: county,
CA_COUNTRY:country,
)
grain (address_id)
address memory.customer_address;
Derived Concepts and Filtering
Now that we understand the basics of selection, let's get into the world of derivation, multi-level aggregation, and filtering.
We're going to stick with store_sales
data for now.
Let's run through some examples
State Aggregation
select
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
count(store_sales.customer.id) as customer_count,
total_sales / customer_count as average_sales_per_customer
order by
average_sales_per_customer desc
limit 10;
Mixed aggregate
select
--store_sales.customer.id,
store_sales.customer.first_name,
--store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
--sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
total_sales / total_state_sales as fraction_of_total_state_sales
order by
fraction_of_total_state_sales desc
limit 10;
Filtering
WHERE store_sales.customer_demographic.education_status = 'College'
SELECT
--store_sales.customer.id,
store_sales.customer.first_name,
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
--sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
total_sales / total_state_sales as fraction_of_total_state_sales
order by
store_sales.customer.state asc,
fraction_of_total_state_sales desc
limit 10;
Nuanced Filter
SELECT
store_sales.customer.state,
store_sales.customer_demographic.education_status,
sum(filter store_sales.sales_price where store_sales.customer_demographic.education_status = 'College') as college_sales,
sum(store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College') as college_sales_alt,
--sum(store_sales.sales_price ) by store_sales.customer.state as total_state_sales,
college_sales / total_state_sales as fraction_of_total_state_sales
HAVING
store_sales.customer_demographic.education_status = 'College'
order by
store_sales.customer.state asc,
fraction_of_total_state_sales desc
limit 10;
Find the average per-state rank by total store sales of items
with ranked_states as
select
store_sales.item.name,
store_sales.customer.state,
rank store_sales.item.name
over store_sales.customer.state
order by sum(store_sales.sales_price) by store_sales.item.name, store_sales.customer.state desc
as sales_rank;
select
ranked_states.store_sales.item.name,
avg(cast(ranked_states.sales_rank as int))-> avg_sales_rank,
max(cast(ranked_states.sales_rank as int))-> max_sales_rank,
min(cast(ranked_states.sales_rank as int))-> min_sales_rank
order by
avg_sales_rank desc
limit 10
;
Sandbox
Ready? Let's have you try out some queries!
TIP
For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!
Can you answer these questions? (click to show a possible answer)
- What were the sales by year in the state of CA? >
- What was the average yearly sales for the state of CA? >
- Which customer demographics had the most store sales in 2001, and what were the sales within that demographic in hawaii? >
- What was the most popular item bought by customers in massachesetts and kentucky by customers with more than 5 orders? >
The following concepts are predefined for you and can be referenced by name.
Concept Search
There's more concepts that we can reasonably show you available. Search this box for inspiration:
WARNING
Stick to querying concepts with the same root for now - eg store_sales.x and store_sales.y. Or if you're feeling adventerous, read on to find out how to merge models to enable cross-namespace querying.
Multiple Models
We've had some fun with store sales, but what about the rest of the dataset?
A typical Trilogy script will be based on importing existing models - and these models may themselves have imports. But if you have two models you want to connect, how can you do that?
There's two tools. Let's start with the one you generally don't want to use - a query merge.
Query Merges
Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.
This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.
TIP
Only use query merges when you are aligning conceptually unrelated topics into a tabular dataset.
SELECT 1
MERGE
SELECT 2
ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
AND
<align_alias_2>:<select 1 concept>,<select 2 concept>
Model Merges
The preferred method is to merge the model on one or more concepts. This tells Trilogy that these two fields are the "same". For example, if you have a sales dataset and a holidays dataset, they might have the following fields:
Sales: 'order_date', 'ship_date', 'returned_date', 'order_id'
Holidays: 'date', '<>.holiday_name'
You want to know what sales were ordered on holidays - you would merge the holidays date into the sales.order_date, and you could now easily query select holiday_name, count(order_id)
.
If you wanted to see orders that shipped on a holiday, you'd merge them on ship date - and if you wanted to be able to query both, you could import the holidays dataset under two differet names and merge them independently.
Merge on one: MERGE <concept1> into <modifiers?><concept2>
Merge on many: MERGE <namespace1>.* into <modifiers?><namespace2>.*
Let's try both out:
Query Merge
SELECT
store_sales.date.year,
count(store_sales.ticket_number) as store_order_count
HAVING
store_order_count>0
MERGE
SELECT
web_sales.date.year,
count(web_sales.order_number) as web_order_count
HAVING
web_order_count>0
ALIGN
report_date: store_sales.date.year, web_sales.date.year
ORDER BY
report_date asc;
Model Merge
MERGE store_sales.date.* into ~date.*;
MERGE web_sales.date.* into ~date.*;
SELECT
date.year,
count(web_sales.order_number) as web_order_count,
count(store_sales.ticket_number) as store_order_count
HAVING
web_order_count>0 or store_order_count>0
ORDER BY
date.year asc
LIMIT 100;
TIP
Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.
Saving Results / ETL
Imagine you want to create tables or save the outputs of a query to power a dashboard.
Trilogy supports this through the persist
keyword.
This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.
In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.
TIP
In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy allows you to explicitly define and manage those caches via persist statements.
Basic Persist
MERGE store_sales.customer.* into customer.*;
MERGE web_sales.customer.* into customer.*;
MERGE store_sales.date.* into store_sales.date.*;
MERGE web_sales.date.* into web_sales.date.*;
PERSIST customer_fact into customer_fact from
WHERE date.year = 2000
AND customer.state = 'VT'
SELECT
customer.id,
date.year,
sum(store_sales.sales_price) as total_store_sales,
sum(web_sales.sales_price) as total_web_sales,
total_store_sales + total_web_sales as total_sales
TrilogyT
Interested in using Trilogy for data transformation as well as data access?
TrilogyT is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.
You can read more in the TrilogyT docs.
Benefits of A Semantic Layer
A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.
Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.
This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.
Trilogy-NLP / GenAI
Trilogy-NLP is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.
WARNING
Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.
Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.
You can experiment with querying the TPC-ds dataset with trilogy-NLP below.
TIP
You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.
Want to Learn More?
For a deeper dive into the language and philosphy, head over to concepts.
For an assortment of related reading, head over to the blog page.
For the background on why Trilogy exists, head over to the thesis page.
Can't get enough of TPC-DS? Head over to the tpc-ds blog
For more details on a product, go to the product pages: