Overview

SELECT without the FROM, JOIN, or GROUP BY

Trilogy is a typed language for declarative data access that compiles to SQL. Users define and query against a semantic layer directly, instead of tables - but in familiar SQL syntax.

Trilogy compiles their query to SQL against the right tables, with the right joins. Use Trilogy directly or generate SQL to be used in other tools.

# just select what you want
SELECT
    product_line,
    revenue,
    revenue_from_top_customers,
    revenue_from_top_customers / revenue as revenue_from_top_customers_pct
WHERE
    product_line in ('widgets', 'doodads')

Why is this useful?

For the Analyst >

For the Engineer >

For the Data Engineer >

For Everyone >

A basic Trilogy statement is one or more lines ending a semicolon. SQL attempts to follow SQL syntax as closely as possible for overlapping areas, while removing some features [such as joins and the FROM clause] that are handled by the semantic layer.

The most common statement is a select, which will start with select and then have one or more concepts or constant references to output. Each item in the select must be mapped to an output name via the arrow assignment -> operator or an 'as' (they are equivalent). This example is equivalent to "select 1 as constant_one, 1 as constant_one_2, '2' as constant_string_two" in most databases.

select
    1 -> constant_one,
    1 as constant_one_2,
    '2'-> constant_string_two
;

Tips

The <- or -> assignment operator is used throughout Trilogy, but where there is an equivalent default in standard SQL that will also be supported, such as 'as' for select.

Concepts

Trilogy queries are built on concepts, which can be constants - such as integers or strings, shown in the first query - or abstract semantic refernces that an be bound to one or more tables. Concepts are the core semantic building block in Trilogy. Trilogy concepts are either keys/properties - values such as a name, ID, or price - or metrics, which are aggregatable to different levels. The cost of a banana is a property; the total cost of groceries is a metric.

Tips

Concepts represent unique dimensional spaces, where every key defines a unique cardinality vector, and properties are additional values stored along that dimension. They're just classic normalized dimensions.

Query Resolution

Joins and source tables are automatically resolved in Trilogy. You won't ever explicitly specify one in your query; you're declaring what you want, not how to get it. The responsibility of how to get it is delegated to the semantic model.

Warning

This means that the correctness of the model is critical to getting an accurate result. The upside is that your typical user can consume the model with high confidence.

For any concept based on a transformation, resolution happens by recursively unpacking the concept into components and checking if there is any source materialized with those components. Eventually, all base concepts will be reached, and these will then be resolved to the underlying tables. The compiler then stitches these together into a set of CTEs and joins to get the final result, and does an optimization pass to remove redundant work.

Datasources

A table in a warehouse is defined as a dataset with each (useful) column bound to a concept. (This is analagous to primary key, foreign key relations in a typical database design doc, but it's possible for multiple tables to be annotated as the "full" source of a given concept). The datasource will also have a "Grain", representing the PK of a particular row in the table, to ensure that the correct level of aggregation is used. [this also exists for queries, covered in more detail below.] A full concept assignment represents that all values of the concept are present in the column and is a strong link; a partial assignment is a weak link and represents a subset of values (a foreign-key relation).

The datasource syntax is covered in more detail below, but datasources can be defined off of tables or raw SQL queries to provide flexibility.

Joins

Joins are created when there is not a single datasource that can resolve every concept in a query. When that happens, the compiler will select multiple datasources and combine them with a join.

A concept can be directly queried against anything that it has a strong or weak connection to. If a concept is not directly linked to another concept, but one or more intermediate concepts can be introduced that together will bridge the concepts, then they can be queried together.

Tips

Joins may also be introduced by specific functions, such as filter calls or aggregation, which require building a CTE at a specific grain and then joining it back into the rest of the query results.

Resolution in Practice

If we take a simplified warehouse ERD with two tables with order info, one with customers, and one with products.

Orders have direct soft links to customers and products. A query of 'select product_name, date(order_placed)' is resolved via the weak connection between orders and products via the common product_id assignment.

Warning

If there are multiple paths to resolve concepts that are not directly linked, a query will fail to compile unless one is specifically hinted. See the "ambiguous query resolution" section for more details. For example, if a model has orders associated with the source warehouse and the delivery vehicle, and both warehouses and delivery vehicles can be associated with color, asking for orders by color will be ambiguous - the query needs to specify order warehouses by color or order delivery vehicles by color.

A query of 'select productname, count(customer_id)' can _also be resolved, but will require adding in an intermediate join across the order_id universe to bridge those concepts.

Grains and Aggregation

Besides concepts, every datasource and query in Trilogy has a "grain". The grain represents the unique combination of vectors stored in the table/query.

A query is automatically grouped - or aggregated - up to the grain of all keys in the output selection.

Metrics will be calculated at this level using appropriate logic.

Tips

A property is a special key that is related to a parent key; for example, an order ID might have a time it was placed, a location, a source. Only the parent key is included in the grain if a property and the base key are both in the select.

Models and Reuse

Typical queries will import an existing semantic model to reference concepts. This is done via imports, which are similar to imports in Python or other languages. A import will be a relative file path, with directories represented by a . separator, and requires an alias to be bound to in the local namespace.

import bigquery.github.repo as repo;

Querying

Once imported, a model can be queried using typical SQL syntax. Note that no tables need to be referenced.

select
    repo.repo,
    repo.license,
order by
    repo.repo asc
limit 10;

Extension By Assignment

A model can be extended at any point via assignment operations, where a function is being used to derive a new concept.

Functions have defined output data types, so we don't need to specify a type here. The auto type will let the compiler infer the type [property, key, or metric].

auto license_count <- count(repo.repo) by repo.license;

select
  repo.license,
  license_count
order by
  license_count desc
limit 10;

Extension by Query

A query can also be used to extend a model.

This query shows a select with an assignment, where the select is creating a new concept. This new concept can be used in another query without requiring redefinition. An explicit definition is generally preferable, but this is useful for one-off queries.

select
  repo.license,
  count(repo.repo) -> license_count
order by
  license_count desc
limit 10;

Filtering

Trilogy supports filtering in 3 main ways.

By Select

A select statement can be filtered at the output. Only concepts directly output by the select can be filtered.

Tips

It can be useful to use the hidden modifier (--) on a select column to include a column in a query solely for filtering without returning it in the result set.

Basic Filter


const x <- unnest([1,2,3,4]);

select
  --x,
  x*x*x -> x_cubed
where 
  (x % 2) = 0;

By Rowset

A rowset - a reusable CTE - can be defined that does filtering on the output. This is the best way to filter a collection of concepts, as it can be reused across multiple queries.

CTE Filtering


auto x <- unnest([1,2,3,4]);

with even_squares as select 
    x, 
    x*x as x_squared
where (x_squared %2) = 0;

select 
    even_squares.x_squared
;

Filtered Concept

A concept can be directly filtered via a filter statement to create a new concept. This is useful concise syntax for creating a reusuable subset of a larger metric, and can be particularly useful to feed into aggregates.

These first two queries will produce identical outputs, but the filtered concept can be re-used in other queries.

-- using a filtered concept
auto mit_repo <- filter repo.repo where repo.license = 'mit';
-- use the new concept we just defined
select
  mit_repo,
  repo.license,
limit 10;
-- filter the concepts in a cte
with mit_repos as select
  repo.repo,
  repo.license,
where
  repo.license = 'mit';
select 
  mit_repos.repo,
  mit_repos.license;
-- filter the outputs via the where clause
select
  repo.repo,
  repo.license,
where
  repo.license = 'mit'
limit 10;

Concept Filter


const x <- unnest([1,2,2,3]);

auto even_x <- filter x where (x % 2) = 0;

select 
    x, 
    even_x
;

Concept Filter Where Clause


const x <- unnest([1,2,2,3]);

auto y <- x+1;

auto odd_y <- filter x where (x % 2) = 0;

select 
    count(odd_y) -> odd_y_count,
    count(x) -> x_count
;

Functions

Functions let you define reusable expressions.

Function


const x <- unnest([1,2,2,3]);

def cube_and_square(x) -> x*x*x + x*x+ 1;

select 
    x, 
    @cube_and_square(x) -> cubes
;

Windows and Ranks

Trilogy supports window functions through specific window keywords, such as rank, lag, and lead. As with filters, ranking should generally be defined explicitly as a new concept to enable reuse, then referenced in the query.

Rank Example

auto license_repo_count <- count(repo.repo) by repo.license;

auto license_rank <- rank repo.license by license_repo_count desc;

select
  repo.license,
  license_repo_count,
  license_rank
where
  license_rank<10
order by
  license_rank asc
limit 10;

Rowsets/CTEs

Reusable rows can be created using a rowset/CTE definition. This is useful for creating a virtual table that can be used in multiple queries, similar to a CTE in typical SQL but in the global namespace. This can be particularly handy for window functions, to make filtering a set of concepts on rank/row_number easier.

CTE Example

with mit_repos as select
  repo.repo,
  repo.license,
where repo.license='mit';

select
  mit_license.repo.repo,
  mit_license.repo.license
where mit_license.repo.repo like '%test%'
limit 10;

select
  count(mit_license.repo.repo) -> total_mit_repos,
limit 10;

Modeling

Modeling is the process of creating a semantic mapping to your database before writing a query.

Modeling has two core phases; first defining your concepts, and second binding them to your database through datasource definitions.

Concepts

Concepts are either bare declarations of the form purpose name type or a derivation of the form purpose name <- expression.

A property will have an additional reference clause, of one of two forms

property key.property_name type for a property that is directly related to a key
property <key1,key2>.property_name type for a property that has a composite key (ex property <order_id, product_id>.order_revenue int)

Valid purposes are

key: a unique identifier
property: a value associated with a key
metric: an aggregatable value
auto: let the compiler assign (useful for derived concepts)

Tips

In general, you should only need to use keys, property, and auto.

Implicit Concepts

Some concepts are auto-generated; example date fields will have year, month, etc.

Rowsets

Rowsets, covered above, are reusable virtual CTEs. A rowset definition is a named select statement that can be used in multiple queries. A rowset will implicitly create a new namespace containing all concepts output in the select.

Syntax:

with <name> as select
  repo.repo,
  repo.license,
where repo.license='mit';

select
  <name>.<concept>
;

Datasources

Datasources are the interface between a model and a database. They can be created, migrated, and altered without requiring any changes to queries built on top of the model, assuming that there are still valid mappings for each required concept. Static analysis can can be used to ensure that all queries are still valid after a datasource is altered, enabling safe migrations.

Datasource Examples

A basic datasource defines a mapping between columns and concepts, a grain, or granularity of table and the address in the underlying database that will be queried.

Tips

Getting the grain right is critical. Failure to appropriately identify the primary key of a table will result in incorrect tables. If you don't know the grain, leave the grain clause empty and Trilogy/Trilogy will always handle it conservatively.

Datasources can also be defined off queries, which enables expressing any kind of logic the underlying SQL database can support. However, datasources using custom SQL will be harder to migrate between different SQL dialects.

datasource licenses (
  repo_name:  repo.repo,
  license: repo.license,
)
grain (repo.repo)
address bigquery-public-data.github_repos.licenses;

datasource languages (
  repo_name: repo.repo,
  language.name: language.language,
  language.bytes: language.per_repo_bytes
  )
  grain (repo.repo, language.language)
  query '''
select
  repo_name,
  language
FROM '''bigquery-public-data.github_repos.languages'''
CROSS JOIN UNNEST(language) AS language
''';

Concept Modifiers

In a typical relational model, not all tables will have the full 'universe' of a given value.

For example, an ecommerce website might have a list of customers and a list of orders. The order table has a customer field, but not every customer has ordered.

At the datasource level, this is captured by marking the binding as partial with a ~.

For example, if we had one table with all licenses, and one table that had repos + licenses, we could model that as:

datasource licenses (
  license: license
)
grain (license)
address bigquery-public-data.github_repos.licenses;

datasource repos (
  repo_name:  repo.repo,
  license: ~license,
)
grain (repo.repo)
address bigquery-public-data.github_repos.repos;

Sometimes columns can contain nulls. SQL's handling of nulls can be complex, so you can provide this context to the engine with the ? modifier.

This update to the model would indicate that the license field in the repos table can be null.

datasource licenses (
  license: ?license
)
grain (license)
address bigquery-public-data.github_repos.licenses;

Reference

Select

Derive

const number <- unnest([1,2,2,3]);
SELECT 
  number+2 as num_plus_2
;

Auto Aggregate

const number <- unnest([1,2,2,3]);
SELECT 
  sum(number) as number_sum
;

Grouped Aggregate

key idx int;
property idx.idx_val int;
datasource numbers(
    idx: idx,
    x: idx_val
)
grain (idx)
query '''
select 1 idx, 1 x
union all
select 2, 2
union all
select 3, 2
union all
select 4, 3
''';

SELECT
  idx_val,
  count(idx) as number_count
order by
    idx_val asc;

Window Functions

Window functions are similar to their SQL counterparts, but can be used in any location in a query or grouped by normally.

row_numberrank

Basic Rank

const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- rank x order by x desc;

select x, z;

Basic Row Number

const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- row_number x over x order by y desc;

select x, z;

Rowsets

Rowsets are named select statements, that can be used to create virtual rows for reuse. They are particularly useful alongside window functions because they grant the ability to filter across multiple conepts with a single where clause.

Basic Rank

const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- rank x order by x desc;

rowset my_rowset <- select x, z where z = 1;

select my_rowset.x, my_rowset.z;

Aggregate Functions

Aggregate functions are used to combine values from multiple rows into a reduced set of rows. Aggregate functions have no defined output grain and will be grouped up to the grain of the other components in the query.

sumcountavgmaxmin

Basic Aggregates


const x <- unnest([1,2,2,3]);

select 
    max(x) as max_x, 
    min(x) as min_x, 
    avg(x) as avg_x, 
    count(x) as count_x;

Aggregates at Grain

key orid int;
property orid.store string;
property orid.customer int;
datasource agg_example(
  orid: orid,
  store: store,
  customer:customer,
)
grain(orid)
query '''
select 1 orid, 'store1' store, 145 customer
union all
select 2, 'store2', 244
union all
select 3, 'store2', 244
union all
select 4, 'store3', 244
''';

select 
    avg(count(orid) by customer) -> avg_customer_orders,
    avg(count(orid) by store) -> avg_store_orders,
    avg(count(orid) by *) -> avg_all_orders
;

Nesting

key orid int;
property orid.store string;
property orid.customer int;
property orid.revenue float;
datasource agg_example(
  orid: orid,
  revenue: revenue,
  store: store,
  customer:customer,
)
grain(orid)
query '''
select 1 orid, 'store1' store, 145 customer, 5.0 revenue
union all
select 2, 'store2', 244, 7.0 revenue
union all
select 3, 'store2', 244, 3.0 revenue
union all
select 4, 'store3', 244, 6.0 revenue
''';

select 
    max(avg(revenue) by store) -> max_avg_store_revenue
;

Date Functions

Date functions are scalar functions that manipulate dates and times.

date_partdate_adddate_diffdate_trunc

Basic Dates


  auto today <- current_datetime();
  
  select 
    date_add(today, day, 1)->tomorrow,
    date_diff(today, today, day)->zero,
    date_trunc(today, year) -> current_year 
  ;

FAQ and Common Issues

Why do I not see all rows in my table when I query a property?

When querying a property the output set will only be the unique values. However, properties that are passed into aggregation functions such as sum, count, etc will implicitly be promoted to agg_operator(group property by property keys)

key order_id int;
property order_id.order_revenue int;

And the table has two columns:

order_id	order_revenue
1	100
2	200
3	200

Selecting order_revenue by itself will return 100, 200. Selecting count(order_revenue) will return 3, as that is implicitly evaluated as count(order_revenue by order_id). Similarly, sum(order_revenue) will return 500, as that is implicitly evaluated as sum(order_revenue by order_id).

Select order_revenue, count(order_revenue) will return 100, 1 for each count, as that is count(group order_revenue by order_revenue).

Tips

It's always safest to be explicit! To avoid implicit defaults, always group the property to the desired grain: sum(group(order_rev) by order_id) will always fetch order rev + order_id before the sum.

Querying a Property

select 
  order_revenue, 
  count(order_revenue)->order_rev_cnt_by_order_rev,
  count(group(order_revenue) by order_id)-> order_rev_cnt_by_order_id
;

Implicitly Identical

select 
  sum(order_revenue) -> total_revenue,
  sum(group(order_revenue) by order_id) -> total_revenue_2
;

Syntax

Abbreviated syntax reference.

Select

select_statement: "select"i select_list where? order_by? limit?

Multi-Select

multi_select_statement: select_statement ("merge" select_statement)+ "align"i align_clause where? order_by? limit?

align_item: IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)\* ","?

align_clause: align_item ("," align_item)\* ","?

Datasource

"datasource" IDENTIFIER "(" column_assignment_list ")" grain_clause? (address | query)

Merge

merge_statement: "merge" IDENTIFIER ("," IDENTIFIER)_ ","? comment_

Import

import_statement: "import" (IDENTIFIER ".") \* IDENTIFIER "as" IDENTIFIER

Function

function_derivation: "def" "(" (IDENTIFIER "," )* ")"  -> EXPR;