Trilogy Transform
Trilogy Transform
Trilogy transform has a simple goal - minimize the need for a manually maintained list of staging tables to feed the data machine in a performant way.
Instead, focus on the outputs - the specific data products you own - and let machines handle the graph in between. We can do that efficiently and reliably by building the intermediate graph not on SQL, but on a more abstracted, declarative language. This means that there's no need to track intermediate state and label assets; every staging asset is a transient artifact.
Principles
- Incrementality - you can adopt it in portions
- Interoperability - we should support a range of modern backends
- Simplicity without compromise - expose sufficient configuration to support the the 80% of ETL uses cases, and offramps for the rest
State
Very experimental.
History
Modern ETL - starting with Airflow, and moving into DBT, Dagster, Prefect, and more - has provided a robust toolkit for building a graph of processing that moves raw data through an enrichment pipeline, producing steadily more "clean" data that eventually lands in "gold" datasets. There's another graph of processing after these to support specific cases, caching, and performance optimization.
Data mesh approaches pivot this to introduce federation, but the core principles remain.
A first cut of these graphs is often clean and optimized - but as a company evolves and data changes, this graph becomes increasingly difficult to manage.
There's precedent in managing this with virtualization - eg - our whole graph is views - but that doesn't solve the fundamental composability problem of SQL.
Since Trilogy defines concepts individually and composes them, you can query against the graph, not tables. This naturally flattens your intermediate graph so that a maximum edge length is the max length of one of yoru calculations. You can't have a processing graph 20 nodes deep unless you actually define a business metric that has 20 layers of nested calculations.
Backends
Native
TrilogyT's native engine will take in a graph of output trilogy queries and generate intermediate tables based on re-used CTEs, with configurable cost profiles.
DBT
TrilogyT's DBT engine integration builds DBT models for you to execute with the DBT Model. The models can be optimized before generation using the native build.
Dagster
Trilogy's Dagster engine builds Dagster assets and dependencies to execute with Dagster. The models can be optimized before generation using the native build.
Optimization
When Trilogy genereates a query, it will generate a series of 'nodes' needed to answer these questions. These include joins, aggregation, filtering, and deriving new concepts.
These nodes can be thought of as having a grain - a unique set of keys - and an arbitrary number of linked concepts. If multiple queries share nodes with the same 'keys', that's a potential opportunity to precalculate and benefit both queries.
TIP
A simple mental model for this is - identify all shared CTEs between two queries, prematerialize them. There's a bit more complexity than that, but not much!
To make this concrete, imagine a customer, order, and store table. The customer table has the state of the customer, and the store table has the state of the store.
If work in a marketing team specialized in generating recurring local sales, you might primarily care about sales that happen in stores to customers in the same state. All your queries might need to join these 3 to filter.
You might have queries you run each day that look something like this
Pull our recurring sales
SELECT
order.store.state,
order.date.week_of_year,
count(order.id) order_count,
--lag 1 order_count by order.store.state, order.date.week_of_year order by order.date.week_of_year asc ->last_week_sales,
last_week_sales/order_count -> wow_orders
where
order.customer.state = order.store.state
and order.date>= dateadd(day, -7, order.date)
;
Get new customers
property order.window_start <- dateadd(day, -10, order.date);
activated_customers <- filter customer.id where count(order.id) ? order.date between dateadd(day, -365, order.date) and window_start;
SELECT
customer.id,
count(order.id) ->
where
order.customer.state = order.store.state
and order.date>= window_start
and customer_id not in (activated_customers)
;
Each of these is going to require joining stores, orders, and customers. If you schedule these, we can share a calculated aggregate between all of them at the highest common point, and the entire graph will be faster and cheaper.