Overview

SELECT without the FROM, JOIN, or GROUP BY

Trilogy is a typed language for declarative data access that compiles to SQL. Users define and query against a semantic layer directly, instead of tables - but in familiar SQL syntax.

Trilogy compiles their query to SQL against the right tables, with the right joins. Use Trilogy directly or generate SQL to be used in other tools.

# just select what you want
SELECT
    product_line,
    revenue,
    revenue_from_top_customers,
    revenue_from_top_customers / revenue as revenue_from_top_customers_pct
WHERE
    product_line in ('widgets', 'doodads')

Why is this useful?

  • For the Analyst >
  • For the Engineer >
  • For the Data Engineer >
  • For Everyone >
  • Syntax

    A basic Trilogy statement is one or more lines ending a semicolon. SQL attempts to follow SQL syntax as closely as possible for overlapping areas, while removing some features [such as joins and the FROM clause] that are handled by the semantic layer.

    The most common statement is a select, which will start with select and then have one or more concepts or constant references to output. Each item in the select must be mapped to an output name via the arrow assignment -> operator or an 'as' (they are equivalent). This example is equivalent to "select 1 as constant_one, 1 as constant_one_2, '2' as constant_string_two" in most databases.

    select
        1 -> constant_one,
        1 as constant_one_2,
        '2'-> constant_string_two
    ;
    

    TIP

    The <- or -> assignment operator is used throughout Trilogy, but where there is an equivalent default in standard SQL that will also be supported, such as 'as' for select.

    Concepts

    Trilogy queries are built on concepts, which can be constants - such as integers or strings, shown in the first query - or abstract semantic refernces that an be bound to one or more tables. Concepts are the core semantic building block in Trilogy. Trilogy concepts are either keys/properties - values such as a name, ID, or price - or metrics, which are aggregatable to different levels. The cost of a banana is a property; the total cost of groceries is a metric.

    TIP

    Concepts represent unique dimensional spaces, where every key defines a unique cardinality vector, and properties are additional values stored along that dimension. They're just classic normalized dimensions.

    Query Resolution

    Joins and source tables are automatically resolved in Trilogy. You won't ever explicitly specify one in your query; you're declaring what you want, not how to get it. The responsibility of how to get it is delegated to the semantic model.

    WARNING

    This means that the correctness of the model is critical to getting an accurate result. The upside is that your typical user can consume the model with high confidence.

    For any concept based on a transformation, resolution happens by recursively unpacking the concept into components and checking if there is any source materialized with those components. Eventually, all base concepts will be reached, and these will then be resolved to the underlying tables. The compiler then stitches these together into a set of CTEs and joins to get the final result, and does an optimization pass to remove redundant work.

    Datasources

    A table in a warehouse is defined as a dataset with each (useful) column bound to a concept. (This is analagous to primary key, foreign key relations in a typical database design doc, but it's possible for multiple tables to be annotated as the "full" source of a given concept). The datasource will also have a "Grain", representing the PK of a particular row in the table, to ensure that the correct level of aggregation is used. [this also exists for queries, covered in more detail below.] A full concept assignment represents that all values of the concept are present in the column and is a strong link; a partial assignment is a weak link and represents a subset of values (a foreign-key relation).

    The datasource syntax is covered in more detail below, but datasources can be defined off of tables or raw SQL queries to provide flexibility.

    Joins

    Joins are created when there is not a single datasource that can resolve every concept in a query. When that happens, the compiler will select multiple datasources and combine them with a join.

    A concept can be directly queried against anything that it has a strong or weak connection to. If a concept is not directly linked to another concept, but one or more intermediate concepts can be introduced that together will bridge the concepts, then they can be queried together.

    TIP

    Joins may also be introduced by specific functions, such as filter calls or aggregation, which require building a CTE at a specific grain and then joining it back into the rest of the query results.

    Resolution in Practice

    If we take a simplified warehouse ERD with two tables with order info, one with customers, and one with products.

    A ERD view of tables in a warehouse

    Orders have direct soft links to customers and products. A query of 'select product_name, date(order_placed)' is resolved via the weak connection between orders and products via the common product_id assignment.

    WARNING

    If there are multiple paths to resolve concepts that are not directly linked, a query will fail to compile unless one is specifically hinted. See the "ambiguous query resolution" section for more details. For example, if a model has orders associated with the source warehouse and the delivery vehicle, and both warehouses and delivery vehicles can be associated with color, asking for orders by color will be ambiguous - the query needs to specify order warehouses by color or order delivery vehicles by color.

    A query of 'select productname, count(customer_id)' can _also be resolved, but will require adding in an intermediate join across the order_id universe to bridge those concepts.

    Grains and Aggregation

    Besides concepts, every datasource and query in Trilogy has a "grain". The grain represents the unique combination of vectors stored in the table/query.

    A query is automatically grouped - or aggregated - up to the grain of all keys in the output selection.

    Metrics will be calculated at this level using appropriate logic.

    TIP

    A property is a special key that is related to a parent key; for example, an order ID might have a time it was placed, a location, a source. Only the parent key is included in the grain if a property and the base key are both in the select.

    Models and Reuse

    Typical queries will import an existing semantic model to reference concepts. This is done via imports, which are similar to imports in Python or other languages. A import will be a relative file path, with directories represented by a . separator, and requires an alias to be bound to in the local namespace.

    import bigquery.github.repo as repo;
    

    Querying

    Once imported, a model can be queried using typical SQL syntax. Note that no tables need to be referenced.

    select
        repo.repo,
        repo.license,
    order by
        repo.repo asc
    limit 10;
    

    Extension By Assignment

    A model can be extended at any point via assignment operations, where a function is being used to derive a new concept.

    Functions have defined output data types, so we don't need to specify a type here. The auto type will let the compiler infer the type [property, key, or metric].

    auto license_count <- count(repo.repo) by repo.license;
    
    select
      repo.license,
      license_count
    order by
      license_count desc
    limit 10;
    

    Extension by Query

    A query can also be used to extend a model.

    This query shows a select with an assignment, where the select is creating a new concept. This new concept can be used in another query without requiring redefinition. An explicit definition is generally preferable, but this is useful for one-off queries.

    select
      repo.license,
      count(repo.repo) -> license_count
    order by
      license_count desc
    limit 10;
    

    Filtering

    Trilogy supports filtering in 3 main ways.

    By Select

    A select statement can be filtered at the output. Only concepts directly output by the select can be filtered.

    TIP

    It can be useful to use the hidden modifier (--) on a select column to include a column in a query solely for filtering without returning it in the result set.

    Basic Filter

    To filter a select, ensure the select contains the concept the value to be filtered on (even if hidden)

    
    const x <- unnest([1,2,3,4]);
    
    select
      --x,
      x*x*x -> x_cubed
    where 
      (x % 2) = 0;
    
    

    By Rowset

    A rowset - a reusable CTE - can be defined that does filtering on the output. This is the best way to filter a collection of concepts, as it can be reused across multiple queries.

    CTE Filtering

    A filtered CTE is the best way to filter multiple values by a condition in a reusable way.

    
    const x <- unnest([1,2,2,3]);
    
    with even_squares as select 
        x, 
        x*x as x_squared
    where (x_squared %2) = 0;
    
    select 
        even_squares.x_squared
    ;
    

    Filtered Concept

    A concept can be directly filtered via a filter statement to create a new concept. This is useful concise syntax for creating a reusuable subset of a larger metric, and can be particularly useful to feed into aggregates.

    These first two queries will produce identical outputs, but the filtered concept can be re-used in other queries.

    -- using a filtered concept
    auto mit_repo <- filter repo.repo where repo.license = 'mit';
    -- use the new concept we just defined
    select
      mit_repo,
      repo.license,
    limit 10;
    -- filter the concepts in a cte
    with mit_repos as select
      repo.repo,
      repo.license,
    where
      repo.license = 'mit';
    select 
      mit_repos.repo,
      mit_repos.license;
    -- filter the outputs via the where clause
    select
      repo.repo,
      repo.license,
    where
      repo.license = 'mit'
    limit 10;
    

    Concept Filter

    Filtered concepts are a concise way of reducing a single concept to a subset of values.

    
    const x <- unnest([1,2,2,3]);
    
    const even_x <- filter x where (x % 2) = 0;
    
    select 
        x, 
        even_x
    ;
    

    Concept Filter Where Clause

    The where clause can reference any set of concepts, not just the concept being filtered on.

    
    const x <- unnest([1,2,2,3]);
    
    const y <- x+1;
    
    const odd_y <- filter x where (x % 2) = 0;
    
    select 
        count(odd_y) -> odd_y_count,
        count(x) -> x_count
    ;
    

    Windows and Ranks

    Trilogy supports window functions through specific window keywords, such as rank, lag, and lead. As with filters, ranking should generally be defined explicitly as a new concept to enable reuse, then referenced in the query.

    Rank Example

    auto license_repo_count <- count(repo.repo) by repo.license;
    
    auto license_rank <- rank repo.license by license_repo_count desc;
    
    select
      repo.license,
      license_repo_count,
      license_rank
    where
      license_rank<10
    order by
      license_rank asc
    limit 10;
    

    Rowsets/CTEs

    Reusable rows can be created using a rowset/CTE definition. This is useful for creating a virtual table that can be used in multiple queries, similar to a CTE in typical SQL but in the global namespace. This can be particularly handy for window functions, to make filtering a set of concepts on rank/row_number easier.

    CTE Example

    with mit_repos as select
      repo.repo,
      repo.license,
    where repo.license='mit';
    
    select
      mit_license.repo.repo,
      mit_license.repo.license
    where mit_license.repo.repo like '%test%'
    limit 10;
    
    select
      count(mit_license.repo.repo) -> total_mit_repos,
    limit 10;
    

    Modeling

    Modeling is the process of creating a semantic mapping to your database before writing a query.

    Modeling has two core phases; first defining your concepts, and second binding them to your database through datasource definitions.

    Concepts

    Concepts are either bare declarations of the form purpose name type or a derivation of the form purpose name <- expression.

    A property will have an additional reference clause, of one of two forms

    • property key.property_name type for a property that is directly related to a key
    • property <key1,key2>.property_name type for a property that has a composite key (ex property <order_id, product_id>.order_revenue int)

    Valid purposes are

    • key: a unique identifier
    • property: a value associated with a key
    • metric: an aggregatable value
    • auto: let the compiler assign (useful for derived concepts)

    TIP

    In general, you should only need to use keys, property, and auto.

    Rowsets

    Rowsets, covered above, are reusable virtual CTEs. A rowset definition is a named select statement that can be used in multiple queries. A rowset will implicitly create a new namespace containing all concepts output in the select.

    Syntax:

    with <name> as select
      repo.repo,
      repo.license,
    where repo.license='mit';
    
    select
      <name>.<concept>
    ;
    

    Datasources

    Datasources are the interface between a model and a database. They can be created, migrated, and altered without requiring any changes to queries built on top of the model, assuming that there are still valid mappings for each required concept. Static analysis can can be used to ensure that all queries are still valid after a datasource is altered, enabling safe migrations.

    Datasource Examples

    A basic datasource defines a mapping between columns and concepts, a grain, or granularity of table and the address in the underlying database that will be queried.

    TIP

    Getting the grain right is critical. Failure to appropriately identify the primary key of a table will result in incorrect tables. If you don't know the grain, leave the grain clause empty and Trilogy/Trilogy will always handle it conservatively.

    Datasources can also be defined off queries, which enables expressing any kind of logic the underlying SQL database can support. However, datasources using custom SQL will be harder to migrate between different SQL dialects.

    datasource licenses (
      repo_name:  repo.repo,
      license: repo.license,
    )
    grain (repo.repo)
    address bigquery-public-data.github_repos.licenses;
    
    datasource languages (
      repo_name: repo.repo,
      language.name: language.language,
      language.bytes: language.per_repo_bytes
      )
      grain (repo.repo, language.language)
      query '''
    select
      repo_name,
      language
    FROM '''bigquery-public-data.github_repos.languages'''
    CROSS JOIN UNNEST(language) AS language
    ''';
    
    Concept Modifiers

    In a typical relational model, not all tables will have the full 'universe' of a given value.

    For example, an ecommerce website might have a list of customers and a list of orders. The order table has a customer field, but not every customer has ordered.

    At the datasource level, this is captured by marking the binding as partial with a ~.

    For example, if we had one table with all licenses, and one table that had repos + licenses, we could model that as:

    datasource licenses (
      license: license
    )
    grain (license)
    address bigquery-public-data.github_repos.licenses;
    
    datasource repos (
      repo_name:  repo.repo,
      license: ~license,
    )
    grain (repo.repo)
    address bigquery-public-data.github_repos.repos;
    

    Reference

    Select

    Derive

    A basic select references concepts or derives new ones. New concepts are bound in line with an expression followed by a name. The expression must be separated from the name with a -> or as keyword.

    const number <- unnest([1,2,2,3]);
    SELECT 
      number+2 as num_plus_2
    ;
    

    Auto Aggregate

    Aggregates can be calculated in line.

    const number <- unnest([1,2,2,3]);
    SELECT 
      sum(number) as number_sum
    ;
    

    Grouped Aggregate

    And will group to the grain of the rest of the grain.

    const number <- unnest([1,2,2,3]);
    SELECT
      number, 
      count(number) as number_sum
    ;
    

    Window Functions

    Window functions are similar to their SQL counterparts, but can be used in any location in a query or grouped by normally.

    row_numberrank

    Basic Rank

    A basic window function requires a concept to be doing the window over, and a partition key.

    const x <- unnest([1,2,2,3]);
    const y <- 5;
    auto z <- rank x order by x desc;
    
    select x, z;
    

    Basic Row Number

    A row_number, and other window functions, can also subdivide the concept into smaller windows using over.

    const x <- unnest([1,2,2,3]);
    const y <- 5;
    auto z <- row_number x over x order by y desc;
    
    select x, z;
    

    Rowsets

    Rowsets are named select statements, that can be used to create virtual rows for reuse. They are particularly useful alongside window functions because they grant the ability to filter across multiple conepts with a single where clause.

    Basic Rank

    const x <- unnest([1,2,2,3]);
    const y <- 5;
    auto z <- rank x order by x desc;
    
    rowset my_rowset <- select x, z where z = 1;
    
    select my_rowset.x, my_rowset.z;
    

    Aggregate Functions

    Aggregate functions are used to combine values from multiple rows into a reduced set of rows. Aggregate functions have no defined output grain and will be grouped up to the grain of the other components in the query.

    sumcountavgmaxmin

    Basic Aggregates

    
    const x <- unnest([1,2,2,3]);
    
    select 
        max(x) as max_x, 
        min(x) as min_x, 
        avg(x) as avg_x, 
        count(x) as count_x;
    

    Aggregates at Grain

    Aggregates can be grouped to a specific grain inline, without requiring a full select. A * can be used to represent the full universe of values.

    key orid int;
    property orid.store string;
    property orid.customer int;
    datasource agg_example(
      orid: orid,
      store: store,
      customer:customer,
    )
    grain(orid)
    query '''
    select 1 orid, 'store1' store, 145 customer
    union all
    select 2, 'store2', 244
    union all
    select 3, 'store2', 244
    union all
    select 4, 'store3', 244
    ''';
    
    select 
        avg(count(orid) by customer) -> avg_customer_orders,
        avg(count(orid) by store) -> avg_store_orders,
        avg(count(orid) by *) -> avg_all_orders
    ;
    

    Nesting

    Aggregate functions can be composed and nested arbitrarily.

    key orid int;
    property orid.store string;
    property orid.customer int;
    property orid.revenue float;
    datasource agg_example(
      orid: orid,
      revenue: revenue,
      store: store,
      customer:customer,
    )
    grain(orid)
    query '''
    select 1 orid, 'store1' store, 145 customer, 5.0 revenue
    union all
    select 2, 'store2', 244, 7.0 revenue
    union all
    select 3, 'store2', 244, 3.0 revenue
    union all
    select 4, 'store3', 244, 6.0 revenue
    ''';
    
    select 
        max(avg(revenue) by store) -> max_avg_store_revenue
    ;
    

    Date Functions

    Date Functions are scalar functions that manipulate dates and times.

    date_partdate_adddate_diffdate_trunc

    Basic Dates

    
      auto today <- current_datetime();
      
      select 
        date_add(today, day, 1)->tomorrow,
        date_diff(today, today, day)->zero,
        date_trunc(today, year) -> current_year 
      ;
    

    FAQ and Common Issues

    Why do I not see all rows in my table when I query a property?

    When querying a property the output set will only be the unique values. However, properties that are passed into aggregation functions such as sum, count, etc will implicitly be promoted to agg_operator(group property by property keys)

    and the model is defined as

    key order_id int;
    property order_id.order_revenue int;
    

    And the table has two columns:

    order_idorder_revenue
    1100
    2200
    3200

    Selecting order_revenue by itself will return 100, 200. Selecting count(order_revenue) will return 3, as that is implicitly evaluated as count(order_revenue by order_id). Similarly, sum(order_revenue) will return 500, as that is implicitly evaluated as sum(order_revenue by order_id).

    Trickly, select order_revenue, count(order_revenue) will return 100, 1 for each count, as that is count(group order_revenue by order_revenue).

    TIP

    It's always safest to be explicit! To avoid implicit defaults, always group the property to the desired grain: sum(group(order_rev) by order_id) will always fetch order rev + order_id before the sum.

    Querying a Property

    select 
      order_revenue, 
      count(order_revenue)->order_rev_cnt_by_order_rev,
      count(group(order_revenue) by order_id)-> order_rev_cnt_by_order_id
    ;
    

    Implicitly Identical

    select 
      sum(order_revenue) -> total_revenue,
      sum(group(order_revenue) by order_id) -> total_revenue_2
    ;
    

    Syntax

    Abbreviated syntax reference.

    Select

    select_statement: "select"i select_list where? order_by? limit?

    Multi-Select

    multi_select_statement: select_statement ("merge" select_statement)+ "align"i align_clause where? order_by? limit?

    align_item: IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)* ","?

    align_clause: align_item ("," align_item)* ","?

    Datasource

    "datasource" IDENTIFIER "(" column_assignment_list ")" grain_clause? (address | query)

    Merge

    merge_statement: "merge" IDENTIFIER ("," IDENTIFIER)_ ","? comment_

    Import

    import_statement: "import" (IDENTIFIER ".") * IDENTIFIER "as" IDENTIFIER