Demo - Exploring the TPC-DS Dataset

This demo uses the popular TPC-DS dataset, often used to benchmark performance for decision-support SQL.

We’ll use it a little differently here. DuckDB can natively generate a representative data warehouse from TPC-DS using a built-in extension, and we’ll use that to explore Trilogy syntax. You can read more about Trilogy and our use of this benchmark here.

Tips

This demo uses an in-memory DuckDB database that may take a moment to start for your first query. Trilogy is database-agnostic—the same syntax works on any backend such as Postgres, BigQuery, or Snowflake.

Our First Queries

TPC-DS provides a nice tidy warehouse: 17 dimension tables and 7 fact tables.

Dimensions

call_centercatalog_pagecustomercustomer_addresscustomer_demographicsdate_dimhousehold_demographicsincome_banditempromotionreasonship_modestoretime_dimwarehouseweb_pageweb_site

Fact Tables

catalog_salescatalog_returnsinventorystore_salesstore_returnsweb_salesweb_returns

That’s a lot to work with! Fortunately, someone has already defined a Trilogy model for TPC-DS here, and we’ll be using that directly.

Tips

Can't wait? Hop over to this overview to read more about the basic semantic model.

Let’s focus on Trilogy imports. These typically map closely to fact tables—by design. Queries often center on fact tables, so they’re natural entry points for modeling.

We’ll hide the import statements from the queries shown here—you can assume they’re already loaded into the environment.

customerstore_salesweb_salescatalog_sales

Tips

In the below section, you have the opportunity to run real Trilogy queries. Click the pulsing 'run' button below each query to see results and SQL.

Basic Select

select 
    customer.id, 
    customer.full_name,
limit 5;

Another View, Similar Data

select 
    store_sales.customer.id, 
    store_sales.customer.full_name,
    store_sales.ticket_number, 
limit 5;

Sales By Customer

select 
    store_sales.customer.id,
    store_sales.customer.first_name,
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales
order by
    total_sales desc
limit 10;

The Model

Earlier we mentioned a model that allows us to run these queries. Let’s take a closer look.

Modeling is the heart and art of Trilogy—it’s where we define names and relationships that enable intuitive access to data.

Models define and bind concepts, and they often import other models. They can be 1:1 with tables, but don’t have to be—which is useful for refactoring, extending, or simplifying logic.

Tips

A model defines the relationship between Trilogy concepts and the underlying data. It acts as a contract to generate queries.

Tips

Trilogy uses the same language for querying and modeling. You can define or extend a model as part of an ad hoc workflow—just like SQL.

Here’s the store_sales model we’ve been querying from. The full TPC-DS model includes many such files.

import item as item;
import date as date;
import date as return_date;
import time as time;
import time as return_time;
import customer as customer;
import customer as return_customer;
import promotion as promotion;
import customer_demographic as customer_demographic;
import store as store;
import store as return_store;

# you can interleave modeling and queries;
select 1 -> example;


key ticket_number int;
property <ticket_number,item.id>.quantity int;
property <ticket_number,item.id>.sales_price float;
property <ticket_number,item.id>.list_price float;
property <ticket_number,item.id>.ext_sales_price float;
property <ticket_number,item.id>.ext_wholesale_cost float;
property <ticket_number,item.id>.ext_list_price float;
property <ticket_number,item.id>.ext_discount_amount float;
property <ticket_number,item.id>.coupon_amt float;
property <ticket_number,item.id>.net_profit float;
property <ticket_number,item.id>.is_returned bool;
property <ticket_number,item.id>.net_paid float;
property <ticket_number,item.id>.return_amount float?;
property <ticket_number,item.id>.return_net_loss float?;

# and derive calculated metrics
auto profit <- ext_list_price - ext_wholesale_cost - ext_discount_amount + ext_sales_price;

# bind defined values to a table 
datasource store_sales (
    SS_SOLD_DATE_SK: date.id,
    SS_SOLD_TIME_SK: time.id,
    SS_CUSTOMER_SK: customer.id,
    SS_CDEMO_SK: customer_demographic.id,
    SS_TICKET_NUMBER: ticket_number,
    SS_ITEM_SK: item.id,
    SS_SALES_PRICE: sales_price,
    SS_LIST_PRICE: list_price,
    SS_EXT_SALES_PRICE: ext_sales_price,
    SS_EXT_LIST_PRICE: ext_list_price,
    SS_EXT_WHOLESALE_COST: ext_wholesale_cost,
    SS_EXT_DISCOUNT_AMT: ext_discount_amount,
    SS_NET_PROFIT: net_profit,
    SS_PROMO_SK: promotion.id,
    SS_QUANTITY: quantity,
    SS_COUPON_AMT: coupon_amt,
    SS_STORE_SK: store.id,
    SS_NET_PAID: net_paid,
)
# the grain is used to resolve queries appropriately
grain (ticket_number, item.id)
address memory.store_sales
;

# multiple tables can be used to source metrics
datasource store_returns(
    SR_RETURNED_DATE_SK: return_date.id,
    SR_RETURN_TIME_SK: return_time.id,
    SR_ITEM_SK: ~item.id,
    SR_CUSTOMER_SK: return_customer.id,
    SR_RETURN_AMT: return_amount,
    SR_TICKET_NUMBER: ~ticket_number,
    SR_STORE_SK: return_store.id,
    SR_NET_LOSS: return_net_loss,
    # capture if there is a row in this table
    bool(return_time.id): is_returned,
)
grain (ticket_number, item.id)
address memory.store_returns;

If you look at the top, there are imports. These are the key to reuse in Trilogy; they enable composition of models.

Returns by Different State

where store_sales.return_customer.state is not null 
    and store_sales.return_customer.state != store_sales.customer.state
select
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    store_sales.return_customer.state,
order by
    total_sales desc
limit 10;

In fact, if we drill into customers, we'll see that model itself has an import. That entire extended model has been imported twice; once for the customer who bought, and once for the one that returned, enabling the same logic to be reused across both keys easily.

Beyond reuse, we want abstraction - the key to that is that datasources - which make the models concrete - can be evolved and split transparently to a consuming query. We can even use this to dynamically swap in aggregates when we compute a higher level table with the same outputs. (such as if a session runs a persist to store a query output).

import customer_demographic as demographics;

key id int;
property id.text_id string;
property id.last_name string;
property id.first_name string;
property id.preferred_cust_flag string;
property id.birth_day int;
property id.birth_month int;
property id.birth_year int;
property id.birth_country string;
property id.salutation string;
property id.email_address string;
property id.login string;
property id.last_review_date string;

# avoid having to recalculate this in queries
property id.full_name <- concat(salutation, ' ', first_name, ' ', last_name);

property id.birth_date <- cast(
    concat(cast(birth_year as string), 
        '/', cast(birth_month as string), 
        '/', cast(birth_day as string)
    ) as date
);


key address_id int;
property address_id.address_id_string string;
property address_id.street string;
property address_id.city string;
property address_id.state string?;
property address_id.zip string;
property address_id.county string;
property address_id.country string;

datasource customers (
    C_CUSTOMER_SK: id,
    C_CUSTOMER_ID: text_id,
    C_LAST_NAME: last_name,
    C_FIRST_NAME: first_name,
    C_CURRENT_ADDR_SK: address_id,
    C_CURRENT_CDEMO_SK: demographics.id,
    C_PREFERRED_CUST_FLAG: preferred_cust_flag,
    C_BIRTH_COUNTRY: birth_country,
    C_SALUTATION: salutation,
    C_EMAIL_ADDRESS: email_address,
    C_BIRTH_DAY: birth_day,
    C_BIRTH_MONTH: birth_month,
    C_BIRTH_YEAR: birth_year,
    C_LOGIN: login,
    C_LAST_REVIEW_DATE_SK:last_review_date,
)
grain (id)
address memory.customer;


datasource customer_address(
    CA_ADDRESS_SK: address_id,
    CA_STREET_NAME: street,
    CA_CITY: city,
    CA_STATE: state, # Two character state code; eg. CA for California, MA for Massachusetts
    CA_ZIP: zip,
    CA_COUNTY: county,
    CA_COUNTRY:country,
)
grain (address_id)
address memory.customer_address;

Derived Concepts and Filtering

Let’s dig into derivation, aggregation, and filtering—starting with store_sales.

Let's run through some examples

State Aggregation

select 
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    count(store_sales.customer.id) as customer_count,
    total_sales / customer_count as average_sales_per_customer
order by
    average_sales_per_customer desc
limit 10;

Mixed aggregate

select 
    --store_sales.customer.id,
    store_sales.customer.first_name,
    --store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    --sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
    total_sales / total_state_sales as fraction_of_total_state_sales
order by
    fraction_of_total_state_sales desc
limit 10;

Filtering

WHERE store_sales.customer_demographic.education_status = 'College'
SELECT
    --store_sales.customer.id,
    store_sales.customer.first_name,
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    --sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
    total_sales / total_state_sales as fraction_of_total_state_sales
order by
    store_sales.customer.state asc,
    fraction_of_total_state_sales desc
limit 10;

Nuanced Filter

When that's too blunt an instrument, selection can be further refined in two ways. A filtered concept can be created by using a where clause - "filter store_sales.sales_price where store_sales customer_demographic.education_status = 'College'", for example - and the output of a selection can be filtereing via the HAVING clause (just like SQL). So to see the % of all state sales that is coming from college educated people, we can leave the global space unfiltered, filter sales inline, and then further refine our output rows. Tip: the "?" syntax, eg store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College', can be a more concise inline filtering if you're okay with shorthand!

SELECT
    store_sales.customer.state,
    store_sales.customer_demographic.education_status,
    sum(filter store_sales.sales_price where store_sales.customer_demographic.education_status = 'College') as college_sales,
    sum(store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College') as college_sales_alt,
    --sum(store_sales.sales_price ) by store_sales.customer.state as total_state_sales,
    college_sales / total_state_sales as fraction_of_total_state_sales
HAVING
    store_sales.customer_demographic.education_status = 'College'
order by
    store_sales.customer.state asc,
    fraction_of_total_state_sales desc
limit 10;

Find the average per-state rank by total store sales of items

More complicated derivations can be done by creating a virtual collection of rows [like a CTE in SQL], called a rowset. A rowset is defined by prefixing "with <name> as" or "rowset <name> <-" before a normal select query and can then be queried directly to access the outputs under the alias of the rowset name. Rowsets can be thought of as reusable CTEs, though there are some differences - defined properties that are not used may not ever be materialized, and rowsets can be used in more places than CTEs. You shouldn't need to use rowsets as often as you need to use CTEs, but they're an option.


with ranked_states as
select 
    store_sales.item.name,
    store_sales.customer.state,
    rank store_sales.item.name
        over store_sales.customer.state 
        order by sum(store_sales.sales_price) by store_sales.item.name, store_sales.customer.state  desc 
    as sales_rank;

select 
    ranked_states.store_sales.item.name,
    avg(cast(ranked_states.sales_rank as int))-> avg_sales_rank,
    max(cast(ranked_states.sales_rank as int))-> max_sales_rank,
    min(cast(ranked_states.sales_rank as int))-> min_sales_rank
order by 
    avg_sales_rank desc
limit 10
;

Sandbox

Ready? Let's have you try out some queries!

Tips

For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!

Can you answer these questions? (click to show a possible answer)

What were the sales by year in the state of CA? >
What was the average yearly sales for the state of CA? >
Which customer demographics had the most store sales in 2001, and what were the sales within that demographic in hawaii? >
What was the most popular item bought by customers in massachesetts and kentucky by customers with more than 5 orders? >

The following concepts are predefined for you and can be referenced by name.

Concept Search

There's more concepts that we can reasonably show you available. Search this box for inspiration:

Warning

Stick to querying concepts with the same root for now - eg store_sales.x and store_sales.y. Or if you're feeling adventerous, read on to find out how to merge models to enable cross-namespace querying.

Multiple Models

We've had some fun with store sales, but what about the rest of the dataset?

A typical Trilogy script will be based on importing existing models - and these models may themselves have imports. But if you have two models you want to connect, how can you do that?

There's two tools. Let's start with the one you generally don't want to use - a query merge.

Query Merges

Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.

This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.

Tips

Only use query merges when you are aligning conceptually unrelated topics into a tabular dataset.

SELECT 1 
MERGE
SELECT 2
ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
AND 
<align_alias_2>:<select 1 concept>,<select 2 concept>

Model Merges

The preferred method is to merge the model on one or more concepts. This tells Trilogy that these two fields are the "same". For example, if you have a sales dataset and a holidays dataset, they might have the following fields:

Sales: 'order_date', 'ship_date', 'returned_date', 'order_id'

Holidays: 'date', 'holiday_name'

You want to know what sales were ordered on holidays - you would merge the holidays date into the sales.order_date, and you could now easily query select holiday_name, count(order_id).

If you wanted to see orders that shipped on a holiday, you'd merge them on ship date - and if you wanted to be able to query both, you could import the holidays dataset under two differet names and merge them independently.

Merge on one: MERGE <concept1> into <modifiers?><concept2>

Merge on many: MERGE <namespace1>.* into <modifiers?><namespace2>.*

Let's try both out:

Query Merge

A MERGE clause can be used to combine multiple select statements. When present, there must be an align clause that contains one or more named 'alignment' concepts from each sub select. The output of each select will be merged together based on these keys, and the coalesced value will be in the new name column. This can be used both to merge models and to combine results of multiple queries in the same model, such as showing product orders for a month by both ship date and order date in the same row.

SELECT
    store_sales.date.year,
    count(store_sales.ticket_number) as store_order_count
HAVING
    store_order_count>0
MERGE
SELECT
    web_sales.date.year,
    count(web_sales.order_number) as web_order_count
HAVING
    web_order_count>0 
ALIGN 
    report_date: store_sales.date.year, web_sales.date.year
ORDER BY
    report_date asc;

Model Merge

A standalone MERGE clause can be used to combine two (or all) concepts across two models, enabling them to be queried seamlessly. The syntax is merge <left> into <modifiers?><right>, where the left and right are the concepts to be merged. The ~ symbol in this example is an optional modifier to indicate that the values being merged in might not be complete - for example, not every date might be in store_sales.date - and so querying for the root concept should left outer join across to the other models.

MERGE store_sales.date.* into ~date.*;
MERGE web_sales.date.* into ~date.*;

SELECT
    date.year,
    count(web_sales.order_number) as web_order_count,
    count(store_sales.ticket_number) as store_order_count
HAVING
    web_order_count>0 or store_order_count>0
ORDER BY 
    date.year asc
LIMIT 100;

Tips

Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.

Saving Results / ETL

Imagine you want to create tables or save the outputs of a query to power a dashboard.

Trilogy supports this through the persist keyword.

This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.

In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.

Tips

In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy allows you to explicitly define and manage those caches via persist statements.

Basic Persist


MERGE store_sales.customer.* into customer.*;
MERGE web_sales.customer.* into customer.*;
MERGE store_sales.date.* into date.*;
MERGE web_sales.date.* into date.*;

PERSIST customer_fact into customer_fact from
WHERE date.year = 2000
AND customer.state = 'VT'
SELECT
    customer.id,
    date.year,
    sum(store_sales.sales_price) as total_store_sales,
    sum(web_sales.sales_price) as total_web_sales,
    total_store_sales + total_web_sales as total_sales

TrilogyT

Interested in using Trilogy for data transformation as well as data access?

TrilogyT is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.

You can read more in the TrilogyT docs.

Benefits of A Semantic Layer

A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.

Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.

This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.

Trilogy-NLP / GenAI

Trilogy-NLP is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.

Warning

Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.

Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.

You can experiment with querying the TPC-ds dataset with trilogy-NLP below.

Tips

You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.

Want to Learn More?

For a deeper dive into the language and philosphy, head over to concepts.

For an assortment of related reading, head over to the blog page.

For the background on why Trilogy exists, head over to the thesis page.

Can't get enough of benchmarks? Head over to the tpc-ds blog.

For more details on the Trilogy ecosystem, go to the product pages: