Demo - Exploring the TPC-DS Dataset
Demo - Exploring the TPC-DS Dataset
This demo uses the popular TPC-DS dataset, often used to benchmark performance for decision-support SQL.
We’ll use it a little differently here. DuckDB can natively generate a representative data warehouse from TPC-DS using a built-in extension, and we’ll use that to explore Trilogy syntax. You can read more about Trilogy and our use of this benchmark here.
Tips
This demo uses an in-memory DuckDB database that may take a moment to start for your first query. Trilogy is database-agnostic—the same syntax works on any backend such as Postgres, BigQuery, or Snowflake.
Our First Queries
TPC-DS provides a nice tidy warehouse: 17 dimension tables and 7 fact tables.
Dimensions
Fact Tables
That’s a lot to work with! Fortunately, someone has already defined a Trilogy model for TPC-DS here, and we’ll be using that directly.
Tips
Can't wait? Hop over to this overview to read more about the basic semantic model.
Let’s focus on Trilogy imports
. These typically map closely to fact tables—by design. Queries often center on fact tables, so they’re natural entry points for modeling.
We’ll hide the import statements from the queries shown here—you can assume they’re already loaded into the environment.
Tips
In the below section, you have the opportunity to run real Trilogy queries. Click the pulsing 'run' button below each query to see results and SQL.
Basic Select
select
customer.id,
customer.full_name,
limit 5;
Another View, Similar Data
select
store_sales.customer.id,
store_sales.customer.full_name,
store_sales.ticket_number,
limit 5;
Sales By Customer
select
store_sales.customer.id,
store_sales.customer.first_name,
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales
order by
total_sales desc
limit 10;
The Model
Earlier we mentioned a model that allows us to run these queries. Let’s take a closer look.
Modeling is the heart and art of Trilogy—it’s where we define names and relationships that enable intuitive access to data.
Models define and bind concepts, and they often import other models. They can be 1:1 with tables, but don’t have to be—which is useful for refactoring, extending, or simplifying logic.
Tips
A model defines the relationship between Trilogy concepts and the underlying data. It acts as a contract to generate queries.
Tips
Trilogy uses the same language for querying and modeling. You can define or extend a model as part of an ad hoc workflow—just like SQL.
Here’s the store_sales
model we’ve been querying from. The full TPC-DS model includes many such files.
import item as item;
import date as date;
import date as return_date;
import time as time;
import time as return_time;
import customer as customer;
import customer as return_customer;
import promotion as promotion;
import customer_demographic as customer_demographic;
import store as store;
import store as return_store;
# you can interleave modeling and queries;
select 1 -> example;
key ticket_number int;
property <ticket_number,item.id>.quantity int;
property <ticket_number,item.id>.sales_price float;
property <ticket_number,item.id>.list_price float;
property <ticket_number,item.id>.ext_sales_price float;
property <ticket_number,item.id>.ext_wholesale_cost float;
property <ticket_number,item.id>.ext_list_price float;
property <ticket_number,item.id>.ext_discount_amount float;
property <ticket_number,item.id>.coupon_amt float;
property <ticket_number,item.id>.net_profit float;
property <ticket_number,item.id>.is_returned bool;
property <ticket_number,item.id>.net_paid float;
property <ticket_number,item.id>.return_amount float?;
property <ticket_number,item.id>.return_net_loss float?;
# and derive calculated metrics
auto profit <- ext_list_price - ext_wholesale_cost - ext_discount_amount + ext_sales_price;
# bind defined values to a table
datasource store_sales (
SS_SOLD_DATE_SK: date.id,
SS_SOLD_TIME_SK: time.id,
SS_CUSTOMER_SK: customer.id,
SS_CDEMO_SK: customer_demographic.id,
SS_TICKET_NUMBER: ticket_number,
SS_ITEM_SK: item.id,
SS_SALES_PRICE: sales_price,
SS_LIST_PRICE: list_price,
SS_EXT_SALES_PRICE: ext_sales_price,
SS_EXT_LIST_PRICE: ext_list_price,
SS_EXT_WHOLESALE_COST: ext_wholesale_cost,
SS_EXT_DISCOUNT_AMT: ext_discount_amount,
SS_NET_PROFIT: net_profit,
SS_PROMO_SK: promotion.id,
SS_QUANTITY: quantity,
SS_COUPON_AMT: coupon_amt,
SS_STORE_SK: store.id,
SS_NET_PAID: net_paid,
)
# the grain is used to resolve queries appropriately
grain (ticket_number, item.id)
address memory.store_sales
;
# multiple tables can be used to source metrics
datasource store_returns(
SR_RETURNED_DATE_SK: return_date.id,
SR_RETURN_TIME_SK: return_time.id,
SR_ITEM_SK: ~item.id,
SR_CUSTOMER_SK: return_customer.id,
SR_RETURN_AMT: return_amount,
SR_TICKET_NUMBER: ~ticket_number,
SR_STORE_SK: return_store.id,
SR_NET_LOSS: return_net_loss,
# capture if there is a row in this table
bool(return_time.id): is_returned,
)
grain (ticket_number, item.id)
address memory.store_returns;
If you look at the top, there are imports. These are the key to reuse in Trilogy; they enable composition of models.
Returns by Different State
where store_sales.return_customer.state is not null
and store_sales.return_customer.state != store_sales.customer.state
select
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
store_sales.return_customer.state,
order by
total_sales desc
limit 10;
In fact, if we drill into customers, we'll see that model itself has an import. That entire extended model has been imported twice; once for the customer who bought, and once for the one that returned, enabling the same logic to be reused across both keys easily.
Beyond reuse, we want abstraction - the key to that is that datasources - which make the models concrete - can be evolved and split transparently to a consuming query. We can even use this to dynamically swap in aggregates when we compute a higher level table with the same outputs. (such as if a session runs a persist to store a query output).
import customer_demographic as demographics;
key id int;
property id.text_id string;
property id.last_name string;
property id.first_name string;
property id.preferred_cust_flag string;
property id.birth_day int;
property id.birth_month int;
property id.birth_year int;
property id.birth_country string;
property id.salutation string;
property id.email_address string;
property id.login string;
property id.last_review_date string;
# avoid having to recalculate this in queries
property id.full_name <- concat(salutation, ' ', first_name, ' ', last_name);
property id.birth_date <- cast(
concat(cast(birth_year as string),
'/', cast(birth_month as string),
'/', cast(birth_day as string)
) as date
);
key address_id int;
property address_id.address_id_string string;
property address_id.street string;
property address_id.city string;
property address_id.state string?;
property address_id.zip string;
property address_id.county string;
property address_id.country string;
datasource customers (
C_CUSTOMER_SK: id,
C_CUSTOMER_ID: text_id,
C_LAST_NAME: last_name,
C_FIRST_NAME: first_name,
C_CURRENT_ADDR_SK: address_id,
C_CURRENT_CDEMO_SK: demographics.id,
C_PREFERRED_CUST_FLAG: preferred_cust_flag,
C_BIRTH_COUNTRY: birth_country,
C_SALUTATION: salutation,
C_EMAIL_ADDRESS: email_address,
C_BIRTH_DAY: birth_day,
C_BIRTH_MONTH: birth_month,
C_BIRTH_YEAR: birth_year,
C_LOGIN: login,
C_LAST_REVIEW_DATE_SK:last_review_date,
)
grain (id)
address memory.customer;
datasource customer_address(
CA_ADDRESS_SK: address_id,
CA_STREET_NAME: street,
CA_CITY: city,
CA_STATE: state, # Two character state code; eg. CA for California, MA for Massachusetts
CA_ZIP: zip,
CA_COUNTY: county,
CA_COUNTRY:country,
)
grain (address_id)
address memory.customer_address;
Derived Concepts and Filtering
Let’s dig into derivation, aggregation, and filtering—starting with store_sales
.
Let's run through some examples
State Aggregation
select
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
count(store_sales.customer.id) as customer_count,
total_sales / customer_count as average_sales_per_customer
order by
average_sales_per_customer desc
limit 10;
Mixed aggregate
select
--store_sales.customer.id,
store_sales.customer.first_name,
--store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
--sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
total_sales / total_state_sales as fraction_of_total_state_sales
order by
fraction_of_total_state_sales desc
limit 10;
Filtering
WHERE store_sales.customer_demographic.education_status = 'College'
SELECT
--store_sales.customer.id,
store_sales.customer.first_name,
store_sales.customer.state,
sum(store_sales.sales_price) as total_sales,
--sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
total_sales / total_state_sales as fraction_of_total_state_sales
order by
store_sales.customer.state asc,
fraction_of_total_state_sales desc
limit 10;
Nuanced Filter
SELECT
store_sales.customer.state,
store_sales.customer_demographic.education_status,
sum(filter store_sales.sales_price where store_sales.customer_demographic.education_status = 'College') as college_sales,
sum(store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College') as college_sales_alt,
--sum(store_sales.sales_price ) by store_sales.customer.state as total_state_sales,
college_sales / total_state_sales as fraction_of_total_state_sales
HAVING
store_sales.customer_demographic.education_status = 'College'
order by
store_sales.customer.state asc,
fraction_of_total_state_sales desc
limit 10;
Find the average per-state rank by total store sales of items
with ranked_states as
select
store_sales.item.name,
store_sales.customer.state,
rank store_sales.item.name
over store_sales.customer.state
order by sum(store_sales.sales_price) by store_sales.item.name, store_sales.customer.state desc
as sales_rank;
select
ranked_states.store_sales.item.name,
avg(cast(ranked_states.sales_rank as int))-> avg_sales_rank,
max(cast(ranked_states.sales_rank as int))-> max_sales_rank,
min(cast(ranked_states.sales_rank as int))-> min_sales_rank
order by
avg_sales_rank desc
limit 10
;
Sandbox
Ready? Let's have you try out some queries!
Tips
For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!
Can you answer these questions? (click to show a possible answer)
- What were the sales by year in the state of CA? >
- What was the average yearly sales for the state of CA? >
- Which customer demographics had the most store sales in 2001, and what were the sales within that demographic in hawaii? >
- What was the most popular item bought by customers in massachesetts and kentucky by customers with more than 5 orders? >
The following concepts are predefined for you and can be referenced by name.
Concept Search
There's more concepts that we can reasonably show you available. Search this box for inspiration:
Warning
Stick to querying concepts with the same root for now - eg store_sales.x and store_sales.y. Or if you're feeling adventerous, read on to find out how to merge models to enable cross-namespace querying.
Multiple Models
We've had some fun with store sales, but what about the rest of the dataset?
A typical Trilogy script will be based on importing existing models - and these models may themselves have imports. But if you have two models you want to connect, how can you do that?
There's two tools. Let's start with the one you generally don't want to use - a query merge.
Query Merges
Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.
This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.
Tips
Only use query merges when you are aligning conceptually unrelated topics into a tabular dataset.
SELECT 1
MERGE
SELECT 2
ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
AND
<align_alias_2>:<select 1 concept>,<select 2 concept>
Model Merges
The preferred method is to merge the model on one or more concepts. This tells Trilogy that these two fields are the "same". For example, if you have a sales dataset and a holidays dataset, they might have the following fields:
Sales: 'order_date', 'ship_date', 'returned_date', 'order_id'
Holidays: 'date', 'holiday_name'
You want to know what sales were ordered on holidays - you would merge the holidays date into the sales.order_date, and you could now easily query select holiday_name, count(order_id)
.
If you wanted to see orders that shipped on a holiday, you'd merge them on ship date - and if you wanted to be able to query both, you could import the holidays dataset under two differet names and merge them independently.
Merge on one: MERGE <concept1> into <modifiers?><concept2>
Merge on many: MERGE <namespace1>.* into <modifiers?><namespace2>.*
Let's try both out:
Query Merge
SELECT
store_sales.date.year,
count(store_sales.ticket_number) as store_order_count
HAVING
store_order_count>0
MERGE
SELECT
web_sales.date.year,
count(web_sales.order_number) as web_order_count
HAVING
web_order_count>0
ALIGN
report_date: store_sales.date.year, web_sales.date.year
ORDER BY
report_date asc;
Model Merge
MERGE store_sales.date.* into ~date.*;
MERGE web_sales.date.* into ~date.*;
SELECT
date.year,
count(web_sales.order_number) as web_order_count,
count(store_sales.ticket_number) as store_order_count
HAVING
web_order_count>0 or store_order_count>0
ORDER BY
date.year asc
LIMIT 100;
Tips
Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.
Saving Results / ETL
Imagine you want to create tables or save the outputs of a query to power a dashboard.
Trilogy supports this through the persist
keyword.
This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.
In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.
Tips
In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy allows you to explicitly define and manage those caches via persist statements.
Basic Persist
MERGE store_sales.customer.* into customer.*;
MERGE web_sales.customer.* into customer.*;
MERGE store_sales.date.* into date.*;
MERGE web_sales.date.* into date.*;
PERSIST customer_fact into customer_fact from
WHERE date.year = 2000
AND customer.state = 'VT'
SELECT
customer.id,
date.year,
sum(store_sales.sales_price) as total_store_sales,
sum(web_sales.sales_price) as total_web_sales,
total_store_sales + total_web_sales as total_sales
TrilogyT
Interested in using Trilogy for data transformation as well as data access?
TrilogyT is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.
You can read more in the TrilogyT docs.
Benefits of A Semantic Layer
A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.
Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.
This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.
Trilogy-NLP / GenAI
Trilogy-NLP is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.
Warning
Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.
Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.
You can experiment with querying the TPC-ds dataset with trilogy-NLP below.
Tips
You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.
Want to Learn More?
For a deeper dive into the language and philosphy, head over to concepts.
For an assortment of related reading, head over to the blog page.
For the background on why Trilogy exists, head over to the thesis page.
Can't get enough of benchmarks? Head over to the tpc-ds blog.
For more details on the Trilogy ecosystem, go to the product pages: