Overview
SELECT without the FROM, JOIN, or GROUP BY
Trilogy is a typed language for declarative data access that compiles to SQL. Users define and query against a semantic layer directly, instead of tables - but in familiar SQL syntax.
Trilogy compiles their query to SQL against the right tables, with the right joins. Use Trilogy directly or generate SQL to be used in other tools.
# just select what you want
SELECT
product_line,
revenue,
revenue_from_top_customers,
revenue_from_top_customers / revenue as revenue_from_top_customers_pct
WHERE
product_line in ('widgets', 'doodads')
Why is this useful?
Syntax
A basic Trilogy statement is one or more lines ending a semicolon. SQL attempts to follow SQL syntax as closely as possible for overlapping areas, while removing some features [such as joins and the FROM clause] that are handled by the semantic layer.
The most common statement is a select, which will start with select and then have one or more concepts or constant references to output. Each item in the select must be mapped to an output name via the arrow assignment -> operator or an 'as' (they are equivalent). This example is equivalent to "select 1 as constant_one, 1 as constant_one_2, '2' as constant_string_two" in most databases.
select
1 -> constant_one,
1 as constant_one_2,
'2'-> constant_string_two
;
TIP
The <- or -> assignment operator is used throughout Trilogy, but where there is an equivalent default in standard SQL that will also be supported, such as 'as' for select.
Concepts
Trilogy queries are built on concepts, which can be constants - such as integers or strings, shown in the first query - or abstract semantic refernces that an be bound to one or more tables. Concepts are the core semantic building block in Trilogy. Trilogy concepts are either keys/properties - values such as a name, ID, or price - or metrics, which are aggregatable to different levels. The cost of a banana is a property; the total cost of groceries is a metric.
TIP
Concepts represent unique dimensional spaces, where every key defines a unique cardinality vector, and properties are additional values stored along that dimension. They're just classic normalized dimensions.
Query Resolution
Joins and source tables are automatically resolved in Trilogy. You won't ever explicitly specify one in your query; you're declaring what you want, not how to get it. The responsibility of how to get it is delegated to the semantic model.
WARNING
This means that the correctness of the model is critical to getting an accurate result. The upside is that your typical user can consume the model with high confidence.
For any concept based on a transformation, resolution happens by recursively unpacking the concept into components and checking if there is any source materialized with those components. Eventually, all base concepts will be reached, and these will then be resolved to the underlying tables. The compiler then stitches these together into a set of CTEs and joins to get the final result, and does an optimization pass to remove redundant work.
Datasources
A table in a warehouse is defined as a dataset with each (useful) column bound to a concept. (This is analagous to primary key, foreign key relations in a typical database design doc, but it's possible for multiple tables to be annotated as the "full" source of a given concept). The datasource will also have a "Grain", representing the PK of a particular row in the table, to ensure that the correct level of aggregation is used. [this also exists for queries, covered in more detail below.] A full concept assignment represents that all values of the concept are present in the column and is a strong link; a partial assignment is a weak link and represents a subset of values (a foreign-key relation).
The datasource syntax is covered in more detail below, but datasources can be defined off of tables or raw SQL queries to provide flexibility.
Joins
Joins are created when there is not a single datasource that can resolve every concept in a query. When that happens, the compiler will select multiple datasources and combine them with a join.
A concept can be directly queried against anything that it has a strong or weak connection to. If a concept is not directly linked to another concept, but one or more intermediate concepts can be introduced that together will bridge the concepts, then they can be queried together.
TIP
Joins may also be introduced by specific functions, such as filter calls or aggregation, which require building a CTE at a specific grain and then joining it back into the rest of the query results.
Resolution in Practice
If we take a simplified warehouse ERD with two tables with order info, one with customers, and one with products.
Orders have direct soft links to customers and products. A query of 'select product_name, date(order_placed)' is resolved via the weak connection between orders and products via the common product_id assignment.
WARNING
If there are multiple paths to resolve concepts that are not directly linked, a query will fail to compile unless one is specifically hinted. See the "ambiguous query resolution" section for more details. For example, if a model has orders associated with the source warehouse and the delivery vehicle, and both warehouses and delivery vehicles can be associated with color, asking for orders by color
will be ambiguous - the query needs to specify order warehouses by color
or order delivery vehicles by color
.
A query of 'select productname, count(customer_id)' can _also be resolved, but will require adding in an intermediate join across the order_id universe to bridge those concepts.
Grains and Aggregation
Besides concepts, every datasource and query in Trilogy has a "grain". The grain represents the unique combination of vectors stored in the table/query.
A query is automatically grouped - or aggregated - up to the grain of all keys in the output selection.
Metrics will be calculated at this level using appropriate logic.
TIP
A property is a special key that is related to a parent key; for example, an order ID might have a time it was placed, a location, a source. Only the parent key is included in the grain if a property and the base key are both in the select.
Models and Reuse
Typical queries will import an existing semantic model to reference concepts. This is done via imports, which are similar to imports in Python or other languages. A import will be a relative file path, with directories represented by a . separator, and requires an alias to be bound to in the local namespace.
import bigquery.github.repo as repo;
Querying
Once imported, a model can be queried using typical SQL syntax. Note that no tables need to be referenced.
select
repo.repo,
repo.license,
order by
repo.repo asc
limit 10;
Extension By Assignment
A model can be extended at any point via assignment operations, where a function is being used to derive a new concept.
Functions have defined output data types, so we don't need to specify a type here. The auto type will let the compiler infer the type [property, key, or metric].
auto license_count <- count(repo.repo) by repo.license;
select
repo.license,
license_count
order by
license_count desc
limit 10;
Extension by Query
A query can also be used to extend a model.
This query shows a select with an assignment, where the select is creating a new concept. This new concept can be used in another query without requiring redefinition. An explicit definition is generally preferable, but this is useful for one-off queries.
select
repo.license,
count(repo.repo) -> license_count
order by
license_count desc
limit 10;
Filtering
Trilogy supports filtering in 3 main ways.
By Select
A select statement can be filtered at the output. Only concepts directly output by the select can be filtered.
TIP
It can be useful to use the hidden modifier (--) on a select column to include a column in a query solely for filtering without returning it in the result set.
Basic Filter
To filter a select, ensure the select contains the concept the value to be filtered on (even if hidden)
const x <- unnest([1,2,3,4]);
select
--x,
x*x*x -> x_cubed
where
(x % 2) = 0;
By Rowset
A rowset - a reusable CTE - can be defined that does filtering on the output. This is the best way to filter a collection of concepts, as it can be reused across multiple queries.
CTE Filtering
A filtered CTE is the best way to filter multiple values by a condition in a reusable way.
const x <- unnest([1,2,2,3]);
with even_squares as select
x,
x*x as x_squared
where (x_squared %2) = 0;
select
even_squares.x_squared
;
Filtered Concept
A concept can be directly filtered via a filter statement to create a new concept. This is useful concise syntax for creating a reusuable subset of a larger metric, and can be particularly useful to feed into aggregates.
These first two queries will produce identical outputs, but the filtered concept can be re-used in other queries.
-- using a filtered concept
auto mit_repo <- filter repo.repo where repo.license = 'mit';
-- use the new concept we just defined
select
mit_repo,
repo.license,
limit 10;
-- filter the concepts in a cte
with mit_repos as select
repo.repo,
repo.license,
where
repo.license = 'mit';
select
mit_repos.repo,
mit_repos.license;
-- filter the outputs via the where clause
select
repo.repo,
repo.license,
where
repo.license = 'mit'
limit 10;
Concept Filter
Filtered concepts are a concise way of reducing a single concept to a subset of values.
const x <- unnest([1,2,2,3]);
const even_x <- filter x where (x % 2) = 0;
select
x,
even_x
;
Concept Filter Where Clause
The where clause can reference any set of concepts, not just the concept being filtered on.
const x <- unnest([1,2,2,3]);
const y <- x+1;
const odd_y <- filter x where (x % 2) = 0;
select
count(odd_y) -> odd_y_count,
count(x) -> x_count
;
Windows and Ranks
Trilogy supports window functions through specific window keywords, such as rank, lag, and lead. As with filters, ranking should generally be defined explicitly as a new concept to enable reuse, then referenced in the query.
Rank Example
auto license_repo_count <- count(repo.repo) by repo.license;
auto license_rank <- rank repo.license by license_repo_count desc;
select
repo.license,
license_repo_count,
license_rank
where
license_rank<10
order by
license_rank asc
limit 10;
Rowsets/CTEs
Reusable rows can be created using a rowset/CTE definition. This is useful for creating a virtual table that can be used in multiple queries, similar to a CTE in typical SQL but in the global namespace. This can be particularly handy for window functions, to make filtering a set of concepts on rank/row_number easier.
CTE Example
with mit_repos as select
repo.repo,
repo.license,
where repo.license='mit';
select
mit_license.repo.repo,
mit_license.repo.license
where mit_license.repo.repo like '%test%'
limit 10;
select
count(mit_license.repo.repo) -> total_mit_repos,
limit 10;
Modeling
Modeling is the process of creating a semantic mapping to your database before writing a query.
Modeling has two core phases; first defining your concepts, and second binding them to your database through datasource definitions.
Concepts
Concepts are either bare declarations of the form purpose name type
or a derivation of the form purpose name <- expression
.
A property will have an additional reference clause, of one of two forms
property key.property_name type
for a property that is directly related to a keyproperty <key1,key2>.property_name type
for a property that has a composite key (exproperty <order_id, product_id>.order_revenue int
)
Valid purposes are
- key: a unique identifier
- property: a value associated with a key
- metric: an aggregatable value
- auto: let the compiler assign (useful for derived concepts)
TIP
In general, you should only need to use keys, property, and auto.
Rowsets
Rowsets, covered above, are reusable virtual CTEs. A rowset definition is a named select statement that can be used in multiple queries. A rowset will implicitly create a new namespace containing all concepts output in the select.
Syntax:
with <name> as select
repo.repo,
repo.license,
where repo.license='mit';
select
<name>.<concept>
;
Datasources
Datasources are the interface between a model and a database. They can be created, migrated, and altered without requiring any changes to queries built on top of the model, assuming that there are still valid mappings for each required concept. Static analysis can can be used to ensure that all queries are still valid after a datasource is altered, enabling safe migrations.
Datasource Examples
A basic datasource defines a mapping between columns and concepts, a grain, or granularity of table and the address in the underlying database that will be queried.
TIP
Getting the grain right is critical. Failure to appropriately identify the primary key of a table will result in incorrect tables. If you don't know the grain, leave the grain clause empty and Trilogy/Trilogy will always handle it conservatively.
Datasources can also be defined off queries, which enables expressing any kind of logic the underlying SQL database can support. However, datasources using custom SQL will be harder to migrate between different SQL dialects.
datasource licenses (
repo_name: repo.repo,
license: repo.license,
)
grain (repo.repo)
address bigquery-public-data.github_repos.licenses;
datasource languages (
repo_name: repo.repo,
language.name: language.language,
language.bytes: language.per_repo_bytes
)
grain (repo.repo, language.language)
query '''
select
repo_name,
language
FROM '''bigquery-public-data.github_repos.languages'''
CROSS JOIN UNNEST(language) AS language
''';
Concept Modifiers
In a typical relational model, not all tables will have the full 'universe' of a given value.
For example, an ecommerce website might have a list of customers and a list of orders. The order table has a customer field, but not every customer has ordered.
At the datasource level, this is captured by marking the binding as partial with a ~
.
For example, if we had one table with all licenses, and one table that had repos + licenses, we could model that as:
datasource licenses (
license: license
)
grain (license)
address bigquery-public-data.github_repos.licenses;
datasource repos (
repo_name: repo.repo,
license: ~license,
)
grain (repo.repo)
address bigquery-public-data.github_repos.repos;
Sometimes columns can contain nulls. SQL's handling of nulls can be complex, so you can provide this context to the engine with the ?
modifier.
This update to the model would indicate that the license field in the repos table can be null.
datasource licenses (
license: ?license
)
grain (license)
address bigquery-public-data.github_repos.licenses;
Reference
Select
Derive
A basic select references concepts or derives new ones. New concepts are bound in line with an expression followed by a name. The expression must be separated from the name with a -> or as keyword.
const number <- unnest([1,2,2,3]);
SELECT
number+2 as num_plus_2
;
Auto Aggregate
Aggregates can be calculated in line.
const number <- unnest([1,2,2,3]);
SELECT
sum(number) as number_sum
;
Grouped Aggregate
And will group to the grain of the rest of the grain.
const number <- unnest([1,2,2,3]);
SELECT
number,
count(number) as number_sum
;
Window Functions
Window functions are similar to their SQL counterparts, but can be used in any location in a query or grouped by normally.
Basic Rank
A basic window function requires a concept to be doing the window over, and a partition key.
const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- rank x order by x desc;
select x, z;
Basic Row Number
A row_number, and other window functions, can also subdivide the concept into smaller windows using over.
const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- row_number x over x order by y desc;
select x, z;
Rowsets
Rowsets are named select statements, that can be used to create virtual rows for reuse. They are particularly useful alongside window functions because they grant the ability to filter across multiple conepts with a single where clause.
Basic Rank
const x <- unnest([1,2,2,3]);
const y <- 5;
auto z <- rank x order by x desc;
rowset my_rowset <- select x, z where z = 1;
select my_rowset.x, my_rowset.z;
Aggregate Functions
Aggregate functions are used to combine values from multiple rows into a reduced set of rows. Aggregate functions have no defined output grain and will be grouped up to the grain of the other components in the query.
Basic Aggregates
const x <- unnest([1,2,2,3]);
select
max(x) as max_x,
min(x) as min_x,
avg(x) as avg_x,
count(x) as count_x;
Aggregates at Grain
Aggregates can be grouped to a specific grain inline, without requiring a full select. A * can be used to represent the full universe of values.
key orid int;
property orid.store string;
property orid.customer int;
datasource agg_example(
orid: orid,
store: store,
customer:customer,
)
grain(orid)
query '''
select 1 orid, 'store1' store, 145 customer
union all
select 2, 'store2', 244
union all
select 3, 'store2', 244
union all
select 4, 'store3', 244
''';
select
avg(count(orid) by customer) -> avg_customer_orders,
avg(count(orid) by store) -> avg_store_orders,
avg(count(orid) by *) -> avg_all_orders
;
Nesting
Aggregate functions can be composed and nested arbitrarily.
key orid int;
property orid.store string;
property orid.customer int;
property orid.revenue float;
datasource agg_example(
orid: orid,
revenue: revenue,
store: store,
customer:customer,
)
grain(orid)
query '''
select 1 orid, 'store1' store, 145 customer, 5.0 revenue
union all
select 2, 'store2', 244, 7.0 revenue
union all
select 3, 'store2', 244, 3.0 revenue
union all
select 4, 'store3', 244, 6.0 revenue
''';
select
max(avg(revenue) by store) -> max_avg_store_revenue
;
Date Functions
Date Functions are scalar functions that manipulate dates and times.
Basic Dates
auto today <- current_datetime();
select
date_add(today, day, 1)->tomorrow,
date_diff(today, today, day)->zero,
date_trunc(today, year) -> current_year
;
FAQ and Common Issues
Why do I not see all rows in my table when I query a property?
When querying a property the output set will only be the unique values. However, properties that are passed into aggregation functions such as sum, count, etc will implicitly be promoted to agg_operator(group property by property keys)
and the model is defined as
key order_id int;
property order_id.order_revenue int;
And the table has two columns:
order_id | order_revenue |
---|---|
1 | 100 |
2 | 200 |
3 | 200 |
Selecting order_revenue by itself will return 100, 200. Selecting count(order_revenue) will return 3, as that is implicitly evaluated as count(order_revenue by order_id). Similarly, sum(order_revenue) will return 500, as that is implicitly evaluated as sum(order_revenue by order_id).
Select order_revenue, count(order_revenue) will return 100, 1 for each count, as that is count(group order_revenue by order_revenue).
TIP
It's always safest to be explicit! To avoid implicit defaults, always group the property to the desired grain: sum(group(order_rev) by order_id) will always fetch order rev + order_id before the sum.
Querying a Property
select
order_revenue,
count(order_revenue)->order_rev_cnt_by_order_rev,
count(group(order_revenue) by order_id)-> order_rev_cnt_by_order_id
;
Implicitly Identical
select
sum(order_revenue) -> total_revenue,
sum(group(order_revenue) by order_id) -> total_revenue_2
;
Syntax
Abbreviated syntax reference.
Select
select_statement: "select"i select_list where? order_by? limit?
Multi-Select
multi_select_statement: select_statement ("merge" select_statement)+ "align"i align_clause where? order_by? limit?
align_item: IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)* ","?
align_clause: align_item ("," align_item)* ","?
Datasource
"datasource" IDENTIFIER "(" column_assignment_list ")" grain_clause? (address | query)
Merge
merge_statement: "merge" IDENTIFIER ("," IDENTIFIER)_ ","? comment_
Import
import_statement: "import" (IDENTIFIER ".") * IDENTIFIER "as" IDENTIFIER