Demo - Exploring the TPC-DS Dataset

This demo uses the popular TPD-DS dataset, typically used for benchmarking capability/performance for transaction workloads.

No benchmarking right now, though - we just care that DuckDB can natively produce a nice representative data warehouse for TPC-DS with a built in extension, and that this dataset gives us a good way to explore the language syntax. You can read more about Trilogy and our use of this benchmark tool here hereopen in new window.

TIP

This demo uses an in-memory DuckDB database that may take some time to start for your first query. Trilogy is database agnostic; the same syntax will work on any backend such as Postgres, Bigquery, or Snowflake.

Our First Queries

TPC-DS creates a nice small warehouse; 17 dimension tables and 7 fact tables.

Dimensions

call_centercatalog_pagecustomercustomer_addresscustomer_demographicsdate_dimhousehold_demographicsincome_banditempromotionreasonship_modestoretime_dimwarehouseweb_pageweb_site

Fact Tables

catalog_salescatalog_returnsinventorystore_salesstore_returnsweb_salesweb_returns

That's a lot to work with! Fortunately, someone has already defined a Trilogy model for TPC-DS hereopen in new window; we'll be using that directly. We'll discuss models in more detail in a little bit.

To start out, let's focus on these Trilogy imports. You'll notice that these map closely to fact tables; that's not a coincidence. It's typical for a user query to be centered on one or more facts, so those are natural places to start a modeling hierarchy. Trilogy imports let you expose many entrypoints into the same relationship of tables to support intuitive query patterns.

However, we're going to hide the import statements from all the queries we run here. We'll work in an environment with all of them loaded already!

customerstore_salesweb_salescatalog_sales

TIP

In the below section, you have the opportunity to run real Trilogy queries. Click the pulsing 'run' button below each query to see results and SQL.

Basic Select

A basic select from our customer import to see what our names and IDs look like. Click the SQL tab to see the generated SQL.
select 
    customer.id, 
    customer.full_name,
limit 5;

Another View, Similar Data

The semantic model encodes relationships and supports imports; a hierarchical selection like this - "store_sales.customer" is referencing the customer model imported and related to the store_sales model. Click through to the SQL after running this query and you'll see that you've just done your first join between two tables - congratulations!
select 
    store_sales.customer.id, 
    store_sales.customer.full_name,
    store_sales.ticket_number, 
limit 5;

Sales By Customer

This becomes more relevant as we start to pull in fact information. Just like SQL, you can define new information in a query, such as creating a total_sales calculation. All derived values must have an explicit name and are then reusable further on in the query. Checking the SQL here will show you that you've now moved onto referencing 3 tables!
select 
    store_sales.customer.id,
    store_sales.customer.first_name,
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales
order by
    total_sales desc
limit 10;

The Model

We mentioned earlier that we had a model defined that was lettings us run these queries. Let's peek below the hood. Modeling is the heart and "art" of Trilogy; defining names and relationships that enable intuitive access to data.

Models will define some concepts, bind some concepts, and often - import other models. While models can be 1-1 to tables and often are, they don't need to be - and this can be useful when refactoring, extending, or simplifying models.

TIP

A model describes the relationships between Trilogy concepts and the underlying data. It is a contract that can be used to generate queries.

TIP

A model definition is written in the same language as these queries - the access language and the modeling language are identical in Trilogy. You can define and extend a model as part of an adhoc exploratory workload, just as you can with SQL.

This definition is for the store_sales references we've been querying above. The full tpc-ds model contains multiple files like this.


import item as item;
import date as date;
import date as return_date;
import time as time;
import time as return_time;
import customer as customer;
import customer as return_customer;
import promotion as promotion;
import customer_demographic as customer_demographic;
import store as store;
import store as return_store;

# you can interleave modeling and queries;
select 1 -> example;


key ticket_number int;
property <ticket_number,item.id>.quantity int;
property <ticket_number,item.id>.sales_price float;
property <ticket_number,item.id>.list_price float;
property <ticket_number,item.id>.ext_sales_price float;
property <ticket_number,item.id>.ext_wholesale_cost float;
property <ticket_number,item.id>.ext_list_price float;
property <ticket_number,item.id>.ext_discount_amount float;
property <ticket_number,item.id>.coupon_amt float;
property <ticket_number,item.id>.net_profit float;
property <ticket_number,item.id>.is_returned bool;
property <ticket_number,item.id>.net_paid float;
property <ticket_number,item.id>.return_amount float?;
property <ticket_number,item.id>.return_net_loss float?;

# and derive calculated metrics
auto profit <- ext_list_price - ext_wholesale_cost - ext_discount_amount + ext_sales_price;

# bind defined values to a table 
datasource store_sales (
    SS_SOLD_DATE_SK: date.id,
    SS_SOLD_TIME_SK: time.id,
    SS_CUSTOMER_SK: customer.id,
    SS_CDEMO_SK: customer_demographic.id,
    SS_TICKET_NUMBER: ticket_number,
    SS_ITEM_SK: item.id,
    SS_SALES_PRICE: sales_price,
    SS_LIST_PRICE: list_price,
    SS_EXT_SALES_PRICE: ext_sales_price,
    SS_EXT_LIST_PRICE: ext_list_price,
    SS_EXT_WHOLESALE_COST: ext_wholesale_cost,
    SS_EXT_DISCOUNT_AMT: ext_discount_amount,
    SS_NET_PROFIT: net_profit,
    SS_PROMO_SK: promotion.id,
    SS_QUANTITY: quantity,
    SS_COUPON_AMT: coupon_amt,
    SS_STORE_SK: store.id,
    SS_NET_PAID: net_paid,
)
# the grain is used to resolve queries appropriately
grain (ticket_number, item.id)
address memory.store_sales
;

# multiple tables can be used to source metrics
datasource store_returns(
    SR_RETURNED_DATE_SK: return_date.id,
    SR_RETURN_TIME_SK: return_time.id,
    SR_ITEM_SK: ~item.id,
    SR_CUSTOMER_SK: return_customer.id,
    SR_RETURN_AMT: return_amount,
    SR_TICKET_NUMBER: ~ticket_number,
    SR_STORE_SK: return_store.id,
    SR_NET_LOSS: return_net_loss,
    # capture if there is a row in this table
    bool(return_time.id): is_returned,
)
grain (ticket_number, item.id)
address memory.store_returns;

If you look at the top, there are imports. These are the key to reuse in Trilogy; they enable composition of models.

Returns by Different State

The same model can be bound in different contexts, like the customer model being reused across two customer sources. Checking SQL here will show four different tables, two of which are reused with different foreign key relations. You're picking up this data model quickly - nice work!
where store_sales.return_customer.state is not null 
    and store_sales.return_customer.state != store_sales.customer.state
select
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    store_sales.return_customer.state,
order by
    total_sales desc
limit 10;

In fact, if we drill into customers, we'll see that model itself has an import. That entire extended model has been imported twice; once for the customer who bought, and once for the one that returned, enabling the same logic to be reused across both keys easily.

Beyond reuse, we want abstraction - the key to that is that datasources - which make the models concrete - can be evolved and split transparently to a consuming query. We can even use this to dynamically swap in aggregates when we compute a higher level table with the same outputs. (such as if a session runs a persist to store a query output).

import customer_demographic as demographics;

key id int;
property id.text_id string;
property id.last_name string;
property id.first_name string;
property id.preferred_cust_flag string;
property id.birth_day int;
property id.birth_month int;
property id.birth_year int;
property id.birth_country string;
property id.salutation string;
property id.email_address string;
property id.login string;
property id.last_review_date string;

# avoid having to recalculate this in queries
property id.full_name <- concat(salutation, ' ', first_name, ' ', last_name);

property id.birth_date <- cast(
    concat(cast(birth_year as string), 
        '/', cast(birth_month as string), 
        '/', cast(birth_day as string)
    ) as date
);


key address_id int;
property address_id.address_id_string string;
property address_id.street string;
property address_id.city string;
property address_id.state string?;
property address_id.zip string;
property address_id.county string;
property address_id.country string;

datasource customers (
    C_CUSTOMER_SK: id,
    C_CUSTOMER_ID: text_id,
    C_LAST_NAME: last_name,
    C_FIRST_NAME: first_name,
    C_CURRENT_ADDR_SK: address_id,
    C_CURRENT_CDEMO_SK: demographics.id,
    C_PREFERRED_CUST_FLAG: preferred_cust_flag,
    C_BIRTH_COUNTRY: birth_country,
    C_SALUTATION: salutation,
    C_EMAIL_ADDRESS: email_address,
    C_BIRTH_DAY: birth_day,
    C_BIRTH_MONTH: birth_month,
    C_BIRTH_YEAR: birth_year,
    C_LOGIN: login,
    C_LAST_REVIEW_DATE_SK:last_review_date,
)
grain (id)
address memory.customer;


datasource customer_address(
    CA_ADDRESS_SK: address_id,
    CA_STREET_NAME: street,
    CA_CITY: city,
    CA_STATE: state, # Two character state code; eg. CA for California, MA for Massachusetts
    CA_ZIP: zip,
    CA_COUNTY: county,
    CA_COUNTRY:country,
)
grain (address_id)
address memory.customer_address;

Derived Concepts and Filtering

Now that we understand the basics of selection, let's get into the world of derivation, multi-level aggregation, and filtering.

We're going to stick with store_sales data for now.

Let's run through some examples

State Aggregation

As an example, if we move our aggregation up to the state level, we can reuse or total_sales and a new customer_count to calculate average sales
select 
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    count(store_sales.customer.id) as customer_count,
    total_sales / customer_count as average_sales_per_customer
order by
    average_sales_per_customer desc
limit 10; 

Mixed aggregate

Often, we need to compare concepts at different level of aggregation. Our queries so far have not specified a level, which means they are implicitly grouped by all dimensions in the query. We can combine different levels by explicitly aggregating "by" a concept. Tip: we've commented out some output fields with -- to streamline the result set; these are included in the projection but not returned in final query.
select 
    --store_sales.customer.id,
    store_sales.customer.first_name,
    --store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    --sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
    total_sales / total_state_sales as fraction_of_total_state_sales
order by
    fraction_of_total_state_sales desc
limit 10;

Filtering

The where clause comes first in idiomatic Trilogy, reducing the size of the query space before other operations are applied. (It'll accept a where clause after the select, if you prefer that). A where condition applies to the entire query.
WHERE store_sales.customer_demographic.education_status = 'College'
SELECT
    --store_sales.customer.id,
    store_sales.customer.first_name,
    store_sales.customer.state,
    sum(store_sales.sales_price) as total_sales,
    --sum(store_sales.sales_price) by store_sales.customer.state as total_state_sales,
    total_sales / total_state_sales as fraction_of_total_state_sales
order by
    store_sales.customer.state asc,
    fraction_of_total_state_sales desc
limit 10;
    

Nuanced Filter

When that's too blunt an instrument, selection can be further refined in two ways. A filtered concept can be created by using a where clause - "filter store_sales.sales_price where store_sales customer_demographic.education_status = 'College'", for example - and the output of a selection can be filtereing via the HAVING clause (just like SQL). So to see the % of all state sales that is coming from college educated people, we can leave the global space unfiltered, filter sales inline, and then further refine our output rows. Tip: the "?" syntax, eg store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College', can be a more concise inline filtering if you're okay with shorthand!
SELECT
    store_sales.customer.state,
    store_sales.customer_demographic.education_status,
    sum(filter store_sales.sales_price where store_sales.customer_demographic.education_status = 'College') as college_sales,
    sum(store_sales.sales_price ? store_sales.customer_demographic.education_status = 'College') as college_sales_alt,
    --sum(store_sales.sales_price ) by store_sales.customer.state as total_state_sales,
    college_sales / total_state_sales as fraction_of_total_state_sales
HAVING
    store_sales.customer_demographic.education_status = 'College'
order by
    store_sales.customer.state asc,
    fraction_of_total_state_sales desc
limit 10;
    

Find the average per-state rank by total store sales of items

More complicated derivations can be done by creating a virtual collection of rows [like a CTE in SQL], called a rowset. A rowset is defined by prefixing "with <name> as" or "rowset <name> <-" before a normal select query and can then be queried directly to access the outputs under the alias of the rowset name. Rowsets can be thought of as reusable CTEs, though there are some differences - defined properties that are not used may not ever be materialized, and rowsets can be used in more places than CTEs. You shouldn't need to use rowsets as often as you need to use CTEs, but they're an option.

with ranked_states as
select 
    store_sales.item.name,
    store_sales.customer.state,
    rank store_sales.item.name
        over store_sales.customer.state 
        order by sum(store_sales.sales_price) by store_sales.item.name, store_sales.customer.state  desc 
    as sales_rank;

select 
    ranked_states.store_sales.item.name,
    avg(cast(ranked_states.sales_rank as int))-> avg_sales_rank,
    max(cast(ranked_states.sales_rank as int))-> max_sales_rank,
    min(cast(ranked_states.sales_rank as int))-> min_sales_rank
order by 
    avg_sales_rank desc
limit 10
;

Sandbox

Ready? Let's have you try out some queries!

TIP

For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!

Can you answer these questions? (click to show a possible answer)

  • What were the sales by year in the state of CA? >
  • What was the average yearly sales for the state of CA? >
  • Which customer demographics had the most store sales in 2001, and what were the sales within that demographic in hawaii? >

The following concepts are predefined for you and can be referenced by name.

There's more concepts that we can reasonably show you available. Search this box for inspiration:

WARNING

Stick to querying concepts with the same root for now - eg store_sales.x and store_sales.y. Or if you're feeling adventerous, read on to find out how to merge models to enable cross-namespace querying.

    Multiple Models

    We've had some fun with store sales, but what about the rest of the dataset?

    A typical Trilogy script will be based on importing existing models - and these models may themselves have imports. But if you have two models you want to connect, how can you do that?

    There's two tools. Let's start with the one you generally don't want to use - a query merge.

    Query Merges

    Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.

    This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.

    TIP

    Only use query merges when you are aligning conceptually unrelated topics into a tabular dataset.

    SELECT 1 
    MERGE
    SELECT 2
    ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
    AND 
    <align_alias_2>:<select 1 concept>,<select 2 concept>
    

    Model Merges

    The preferred method is to merge the model on one or more concepts. This tells Trilogy that these two fields are the "same". For example, if you have a sales dataset and a holidays dataset, they might have the following fields:

    Sales: 'order_date', 'ship_date', 'returned_date', 'order_id'

    Holidays: 'date', '<>.holiday_name'

    You want to know what sales were ordered on holidays - you would merge the holidays date into the sales.order_date, and you could now easily query select holiday_name, count(order_id).

    If you wanted to see orders that shipped on a holiday, you'd merge them on ship date - and if you wanted to be able to query both, you could import the holidays dataset under two differet names and merge them independently.

    Merge on one: MERGE <concept1> into <modifiers?><concept2>

    Merge on many: MERGE <namespace1>.* into <modifiers?><namespace2>.*

    Let's try both out:

    Query Merge

    A MERGE clause can be used to combine multiple select statements. When present, there must be an align clause that contains one or more named 'alignment' concepts from each sub select. The output of each select will be merged together based on these keys, and the coalesced value will be in the new name column. This can be used both to merge models and to combine results of multiple queries in the same model, such as showing product orders for a month by both ship date and order date in the same row.
    SELECT
        store_sales.date.year,
        count(store_sales.ticket_number) as store_order_count
    HAVING
        store_order_count>0
    MERGE
    SELECT
        web_sales.date.year,
        count(web_sales.order_number) as web_order_count
    HAVING
        web_order_count>0 
    ALIGN 
        report_date: store_sales.date.year, web_sales.date.year
    ORDER BY
        report_date asc;
    

    Model Merge

    A standalone MERGE clause can be used to combine two (or all) concepts across two models, enabling them to be queried seamlessly. The syntax is merge <left> into <modifiers?><right>, where the left and right are the concepts to be merged. The ~ symbol in this example is an optional modifier to indicate that the values being merged in might not be complete - for example, not every date might be in store_sales.date - and so querying for the root concept should left outer join across to the other models.
    MERGE store_sales.date.* into ~date.*;
    MERGE web_sales.date.* into ~date.*;
    
    SELECT
        date.year,
        count(web_sales.order_number) as web_order_count,
        count(store_sales.ticket_number) as store_order_count
    HAVING
        web_order_count>0 or store_order_count>0
    ORDER BY 
        date.year asc
    LIMIT 100;
    

    TIP

    Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.

    Saving Results / ETL

    Imagine you want to create tables or save the outputs of a query to power a dashboard.

    Trilogy supports this through the persist keyword.

    This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.

    In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.

    TIP

    In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy allows you to explicitly define and manage those caches via persist statements.

    Basic Persist

    The first identifier after persist is the trilogy name of the source; the second is the resource name used in the backing system. (DuckDB in this case)
    
    MERGE store_sales.customer.* into customer.*;
    MERGE web_sales.customer.* into customer.*;
    MERGE store_sales.date.* into store_sales.date.*;
    MERGE web_sales.date.* into web_sales.date.*;
    
    PERSIST customer_fact into customer_fact from
    WHERE date.year = 2000
    AND customer.state = 'VT'
    SELECT
        customer.id,
        date.year,
        sum(store_sales.sales_price) as total_store_sales,
        sum(web_sales.sales_price) as total_web_sales,
        total_store_sales + total_web_sales as total_sales
    
    

    TrilogyT

    Interested in using Trilogy for data transformation as well as data access?

    TrilogyTopen in new window is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.

    You can read more in the TrilogyT docs.

    Benefits of A Semantic Layer

    A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.

    Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.

    This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.

    Trilogy-NLP / GenAI

    Trilogy-NLPopen in new window is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.

    WARNING

    Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.

    Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.

    You can experiment with querying the TPC-ds dataset with trilogy-NLP below.

    TIP

    You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.

      Want to Learn More?

      For a deeper dive into the language and philosphy, head over to concepts.

      For an assortment of related reading, head over to the blog page.

      For the background on why Trilogy exists, head over to the thesis page.

      Can't get enough of TPC-DS? Head over to the tpc-ds blog

      For more details on a product, go to the product pages: