Demo - Exploring the Titanic

This titanic demo is a simple example of how Trilogy syntax works. It uses the common titanic dataset, a single table with the following fields about passengers.

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

Our First Queries

TIP

This demo uses an in-memory DuckDB database that may take some time to start for your first query. Trilogy is database agnostic; the same syntax will work on any backend such as Postgres, Bigquery, or Snowflake.

Basic Select

select passenger.name, passenger.id limit 5;

How many survived?

select 
    passenger.survived, 
    passenger.id.count, 
    count(passenger.id) as passenger_count_alt, 
    count(passenger.id) -> passenger_count_alt_2
;

The Model

Those queries could run because we have a model defined already; we'll dig into this more later. For now, just note that each concept is declared with a type, we have one derived metric, and there is a datasource defined that maps these concepts to a table.

TIP

A model describes the relationships between Trilogy concepts and the underlying data. It is a contract that can be used to generate queries.


key passenger.id int; #numeric identifier created for dataset
property passenger.id.age int?; # can be null, age of passenger
property passenger.id.survived int; # 1 = survived, 0 = died
property passenger.id.name string; # full name
property passenger.id.class int; # class of passenger, 1-3
property passenger.id.fare float; # the price paid by passenger
property passenger.id.cabin string?; # can be null, the cabin the passenger was in
property passenger.id.embarked bool;
property passenger.id.last_name <- split(passenger.name,',')[1];

metric passenger.id.count <- count(passenger.id);

datasource raw_data (
    passengerid:passenger.id,
    age:passenger.age,
    survived:passenger.survived,
    pclass:passenger.class,
    name:passenger.name,
    fare:passenger.fare,
    cabin:passenger.cabin,
    embarked:passenger.embarked,
)
grain (passenger.id)
address raw_titanic;

Derived Concepts and Filtering

TIP

Last_name is defined in the model as splitting a passenger name by a comma and taking the second element. We'll use this derived 'last_name' as a rough family identifier in these queries; if you explore in the sandbox you'll find that's not quite right, but it's good enough for now.

Let's run through some examples

Family Sizing (By Last Name)

select 
    passenger.last_name, 
    count(passenger.id) as family_size
order by 
    family_size desc
limit 5;

How big were families in each class?

select 
    passenger.class,
    max(
        count(passenger.id) by passenger.last_name
    ) by passenger.class
     as max_family_size_per_class,
     avg(
        count(passenger.id) by passenger.last_name
    ) by passenger.class
    -> avg_family_size_per_class, 
    
order by 
    passenger.class asc
;

How many survived from each family (by last name)?

auto surviving_passenger <- 
filter passenger.id where passenger.survived =1; 
select 
    passenger.last_name,
    passenger.id.count,
    count(surviving_passenger) as surviving_size
order by
    passenger.id.count desc
limit 5;

Families (by last name) where everyone survived

auto surviving_passenger<- filter passenger.id where passenger.survived =1; 
select 
    passenger.last_name,
    passenger.id.count,
    count(surviving_passenger) as surviving_size
where
    passenger.id.count=surviving_size
order by
    passenger.id.count desc
limit 5;

Get the first name, last name, and age of the oldest survivor in each family

with survivors as
SELECT
    passenger.last_name, 
    passenger.name,
    passenger.id, 
    passenger.survived,
    passenger.age,  
where 
    passenger.survived =1; 

# now we can reference our rowset derived concepts like any other concept
select 
    --survivors.passenger.id,
    survivors.passenger.name,
    survivors.passenger.last_name,
    survivors.passenger.age,
    --row_number survivors.passenger.id over 
        survivors.passenger.last_name
        order by survivors.passenger.age desc 
    as eldest
where 
    eldest = 1
order by survivors.passenger.last_name desc
limit 5;

Sandbox

Now, try writing your own queries in the sandbox below.

The query box is stateful; a successful query (such as adding a concept) will be saved and can be referenced in future queries.

TIP

For basic queries, Trilogy should be almost identical to SQL. When in doubt, try the SQL syntax!

Can you answer these questions? (click to show a possible answer)

  • Did different classes have different average fares? >
  • Were people in higher classes more likely to survive? >
  • Were certain ages more likely to survive? >
  • What was the average family (assume one last name is one family) survival rate in each class? >

The following concepts are predefined for you and can be referenced by name.

Available Concepts

passenger.idpassenger.agepassenger.survivedpassenger.namepassenger.classpassenger.farepassenger.cabinpassenger.embarked

    Multiple Tables

    You've been able to do some initial analysis on the titanic dataset, but now your data engineer has gotten excited about someone named Kimball. They've refactored your dataset to normalize it, and now you have the following tables.

    • fact_titanic
    • dim_class
    • dim_passenger

    Let's see how we can use Trilogy to query this new dataset.

    Basic Select

    select passenger.name, passenger.id limit 5;

    How many survived?

    select 
        passenger.survived, 
        passenger.id.count, 
        count(passenger.id) as passenger_count_alt, 
        count(passenger.id) -> passenger_count_alt_2
    ;

    Family Sizing (By Last Name)

    select 
        passenger.last_name, 
        count(passenger.id) as family_size
    order by 
        family_size desc
    limit 5;

    How big were families in each class?

    select 
        passenger.class,
        max(
            count(passenger.id) by passenger.last_name
        ) by passenger.class
         as max_family_size_per_class,
         avg(
            count(passenger.id) by passenger.last_name
        ) by passenger.class
        -> avg_family_size_per_class, 
        
    order by 
        passenger.class asc
    ;

    How many survived from each family (by last name)?

    auto surviving_passenger <- 
    filter passenger.id where passenger.survived =1; 
    select 
        passenger.last_name,
        passenger.id.count,
        count(surviving_passenger) as surviving_size
    order by
        passenger.id.count desc
    limit 5;

    Families (by last name) where everyone survived

    auto surviving_passenger<- filter passenger.id where passenger.survived =1; 
    select 
        passenger.last_name,
        passenger.id.count,
        count(surviving_passenger) as surviving_size
    where
        passenger.id.count=surviving_size
    order by
        passenger.id.count desc
    limit 5;

    Get the first name, last name, and age of the oldest survivor in each family

    with survivors as
    SELECT
        passenger.last_name, 
        passenger.name,
        passenger.id, 
        passenger.survived,
        passenger.age,  
    where 
        passenger.survived =1; 
    
    # now we can reference our rowset derived concepts like any other concept
    select 
        --survivors.passenger.id,
        survivors.passenger.name,
        survivors.passenger.last_name,
        survivors.passenger.age,
        --row_number survivors.passenger.id over 
            survivors.passenger.last_name
            order by survivors.passenger.age desc 
        as eldest
    where 
        eldest = 1
    order by survivors.passenger.last_name desc
    limit 5;

    This should look pretty familiar. What's going on? Take a look at the queries being generated by clicking the 'sql' tab to confirm that something has changed.

    The Model

    Let's look at our new model.

    key passenger.id int;
    key _class_id int;
    property passenger.id.age int?;
    property passenger.id.survived bool;
    property passenger.id.name string;
    property passenger.id.fare float;
    property passenger.id.cabin string?;
    property passenger.id.embarked int;
    property passenger.last_name <- index_access(split(passenger.name,','),1);
    
    datasource dim_passenger (
        id:passenger.id,
    	age:passenger.age,
    	name:passenger.name,
    	last_name:passenger.last_name
        )
    grain (passenger.id)
    address dim_passenger;
    
    datasource fact_titanic (
        passengerid:passenger.id,
    	survived:passenger.survived,
    	class_id:_class_id,
    	fare:passenger.fare,
    	cabin:passenger.cabin,
    	embarked:passenger.embarked
        )
    grain (passenger.id)
    address fact_titanic;
    
    datasource dim_class (
        id:_class_id,
    	class:passenger.class
        )
    grain (_class_id)
    address dim_class;
    
    
    

    Note that our concepts are almost unchanged. We've added a new _class_id concept to capture the new surrogate key added to dim_class.

    We've also changed the datasource definitions to point to the new tables.

    But as a consumer - you haven't needed to change a thing. We are still encoding the same logical concepts, and the semantic relations between them have not changed.

    This is the convenience of separating the query language from the data model. The data model expresses a contract that can be evolved independently of the underlying materialized database tables, enabling transparent refactoring, aggregation, and remodeling to reflect the changing needs of the business.

    Multi-Table Sandbox

    Try querying the new model in the sandbox below.

      Extending the Model

      A typical Trilogy script will be based on importing existing models.

      import passengers as passenger;
      

      Often, those moodels may themselves contain imports, which together can create a complex model.

      A passenger model may import a family model, for example, enabling nested references like the below.

      import passengers as passenger;
      
      select passenger.family.last_name, passenger.family.size;
      
      

      For our titanic model, imagine if someone else had set up a reference model of the richest families in the world.

      We want to merge this family information in with our titanic model.

      There are two approaches we can use; one is to merge the results of two models within a single query, and one is to merge the two models on a shared concept so that we can build many dynamic queries against them.

      Query Merges

      Query merges look like selects, but with more than 1 select clause + an align statement that defines how to merge the selects. Think of this as a full outer join between the listed columns to merge the select outputs. The where, order by, and others are identical.

      This can be useful to combine disparate concepts into a single result set row, such as show orders by ship date and delivery date in the same row.

      SELECT 1 
      MERGE
      SELECT 2
      ALIGN <align_alias>:<select 1 concept>,<select 2 concept>
      AND 
      <align_alias_2>:<select 1 concept>,<select 2 concept>
      

      Model Merges

      Model merges are more straightforward syntactically. Conceptually, the represent that the two concepts are synonymous but may not have all the same values; they can be used to bridge concept references across two models.

      Let's try both out:

      MERGE <concept1> into <modifiers?><concept2>

      Query Merge

      # this is already imported, commented for clarity
      # import rich_families as rich_info; 
      
      SELECT
          passenger.last_name,
          count(passenger.id) -> family_count
      MERGE
      SELECT
          rich_info.last_name,
          rich_info.net_worth_1918_dollars
      ALIGN 
          join_last_name:passenger.last_name, rich_info.last_name
      WHERE 
          rich_info.net_worth_1918_dollars is not null
          AND passenger.last_name is not null
      ORDER BY
          family_count desc;

      Model Merge

      # this is already imported, commented for clarity
      # import rich_families as rich_info; 
      MERGE rich_info.last_name into ~passenger.last_name;
      
      SELECT
          passenger.last_name,
          count(passenger.id)-> family_count,
          rich_info.net_worth_1918_dollars,
      WHERE
          rich_info.net_worth_1918_dollars is not null
          and passenger.last_name is not null
      ORDER BY 
          family_count desc ;

      TIP

      Multiple merge statements can be defined between two models; queries will merge across as many concepts as are referenced in the query.

      Saving Results / ETL

      Imagine you want to create tables or save the outputs of a query to power a dashboard.

      Trilogy supports this through the persist keyword.

      This keyword is used to signify that any table created by Trilogy is effectively a cache of a given output. You've already defined canonical sources; if we copy that data into a new table it's only valid until the sources change.

      In base Trilogy, this can be used to create or update a table, such as on powering a dashboard.

      TIP

      In practice, most warehouse will have a finite set of 'roots' and a number of caches derived from them that are refreshed on some cadence to drive reporting, analytics, and performance. Trilogy enables you to explicitly define and manage those caches via persist statements.

      The first query here shows a persist command; the second shows how the generated query will reference the persisted value.

      Basic Persist

      
      property passenger.id.split_cabin <- unnest(split(passenger.cabin, ' '));
      persist cabin_info into dim_cabins from 
      select 
          passenger.id, 
          passenger.split_cabin;

      Query Our Persisted Table

      select passenger.id, passenger.split_cabin limit 20;

      TrilogyT

      TrilogyTopen in new window is a Trilogy integration that adds additional keywords to support ETL workflows and integrates closely with DBT to support a full data warehouse workflow.

      You can try TrilogyT in the Trilogyt demo or read more about it in the docs.

      Benefits of A Semantic Layer

      A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization.

      Trilogy is a semantic layer - the abstract expression of relationships between concepts enables a consistent and coherent view of the data, and the separation of the model from the underlying data enables the model to evolve independently of the underlying data.

      This makes Trilogy a natural fit for any case where you want to operate at this higher level of abstraction - a popular one these days being GenAI. It's much easier for a GenAI model to identify the appropriate concepts for a query and let Trilogy map that to the underlying data model; and as with a human written query this makes the result robust to changes/evolution of your data model.

      Trilogy-NLP / GenAI

      Trilogy-NLPopen in new window is a Trilogy integration that adds natural language processing to Trilogy, enabling users to write queries in natural language and have them translated to SQL via the Trilogy semantic model. The simplified abstraction of a Trilogy model is natural fit for generative AI models, bypassing potential hallucination issues and SQL correctness challenges by focusing the model solely on interpreting the user query and mapping that to the higher level semantic layer.

      WARNING

      Trilogy-NLP is in active development and may not always return correct results. As with other generative AI tools, experimenting with what prompts produce the best results is recommended.

      Trilogy-NLP can either be used in an interactive analytics context, as a helper for users to write queries, or as a backend for a chatbot or other natural language interface to a database.

      You can experiment with querying the titanic dataset we've defined using Trilogy-NLP below.

      TIP

      You can view the generated query just like with Trilogy - Trilogy-NLP is just a different way to write the same queries, and it will always show it's work.

        TCP-DS Comparisons

        TCP-DS is a common database benchmarking tool that models a warehouse retail database. Ensuring that we can accurately replicate the queries against the TCP-DS schema is a good test of the expressive power of Trilogy. A select comparison of where Trilogy and SQL differ is below.

        Typically, Trilogy generated SQL will perform worse than hand-optimized SQL. Trilogy optimizes for speed of authoring/expressiveness/abstraction first, and performance second, though performance must be reasonable enough to not be a blocker.

        WARNING

        This section is a work in progress and will be expanded over time.

        Query 3 - Brand Sales

        This is a fairly straightforward query in both languages.

        SQL

        SELECT dt.d_year,
               item.i_brand_id brand_id,
               item.i_brand brand,
               sum(ss_ext_sales_price) sum_agg
        FROM date_dim dt,
             store_sales,
             item
        WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
          AND store_sales.ss_item_sk = item.i_item_sk
          AND item.i_manufact_id = 128
          AND dt.d_moy=11
        GROUP BY dt.d_year,
                 item.i_brand,
                 item.i_brand_id
        ORDER BY dt.d_year,
                 sum_agg DESC,
                 brand_id
        LIMIT 100;

        Trilogy

        Here we use an inline filter definition to push down the sum filtering, since there is only a single concept.

        import store_sales as store_sales;
        SELECT
            store_sales.date.year,
            store_sales.item.brand_id,
            store_sales.item.brand_name,
            sum(filter store_sales.ext_sales_price 
                    where store_sales.date.month_of_year=11 
                    and store_sales.item.manufacturer_id=128
                )->sum_agg
        WHERE 
            sum_agg
        ORDER BY 
            store_sales.date.year asc,
            sum_agg desc,
            store_sales.item.brand_id asc
        limit 100;

        Query 1 - Customer Returns

        This is a more complicated query - both use a CTE to calculate a set of values, then refine it.

        SQL

        WITH customer_total_return AS
          (SELECT sr_customer_sk AS ctr_customer_sk,
                  sr_store_sk AS ctr_store_sk,
                  sum(sr_return_amt) AS ctr_total_return
           FROM store_returns,
                date_dim
           WHERE sr_returned_date_sk = d_date_sk
             AND d_year = 2000
           GROUP BY sr_customer_sk,
                    sr_store_sk)
        SELECT c_customer_id
        FROM customer_total_return ctr1,
             store,
             customer
        WHERE ctr1.ctr_total_return >
            (SELECT avg(ctr_total_return)*1.2
             FROM customer_total_return ctr2
             WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
          AND s_store_sk = ctr1.ctr_store_sk
          AND s_state = 'TN'
          AND ctr1.ctr_customer_sk = c_customer_sk
        ORDER BY c_customer_id
        LIMIT 100;

        Trilogy

        import store_returns as returns;
        
        with tn_2000 AS select
            returns.customer.text_id,
            returns.store.id,
            sum(returns.return_amount)-> total_returns
        where
            returns.return_date.year = 2000
            and returns.store.state = 'TN';
        
        auto avg_store_returns <- avg(tn_2000.total_returns) by tn_2000.returns.store.id;
        
        select
            tn_2000.returns.customer.text_id,
            --tn_2000.total_returns, 
            --avg_store_returns,
        where
            tn_2000.total_returns > 
            (1.2*avg_store_returns)
        order by 
            tn_2000.returns.customer.text_id asc
        limit 100;