Trilogy Transform

Trilogy Transform

Trilogy transform has a simple goal - minimize the need for a manually maintained list of staging tables to feed the data machine in a performant way.

Instead, focus on the outputs - the specific data products you own - and let machines handle the graph in between. We can do that efficiently and reliably by building the intermediate graph not on SQL, but on a more abstracted, declarative language. This means that there's no need to track intermediate state and label assets; every staging asset is a transient artifact.

Principles

  • Incrementality - you can adopt it in portions
  • Interoperability - we should support a range of modern backends
  • Simplicity without compromise - expose sufficient configuration to support the the 80% of ETL uses cases, and offramps for the rest

State

Very experimental.

History

Modern ETL - starting with Airflow, and moving into DBT, Dagster, Prefect, and more - has provided a robust toolkit for building a graph of processing that moves raw data through an enrichment pipeline, producing steadily more "clean" data that eventually lands in "gold" datasets. There's another graph of processing after these to support specific cases, caching, and performance optimization.

Data mesh approaches pivot this to introduce federation, but the core principles remain.

A first cut of these graphs is often clean and optimized - but as a company evolves and data changes, this graph becomes increasingly difficult to manage.

There's precedent in managing this with virtualization - eg - our whole graph is views - but that doesn't solve the fundamental composability problem of SQL.

Since Trilogy defines concepts individually and composes them, you can query against the graph, not tables. This naturally flattens your intermediate graph so that a maximum edge length is the max length of one of yoru calculations. You can't have a processing graph 20 nodes deep unless you actually define a business metric that has 20 layers of nested calculations.

Backends

Native

TrilogyT's native engine will take in a graph of output trilogy queries and generate intermediate tables based on re-used CTEs, with configurable cost profiles.

DBT

TrilogyT's DBT engine integration builds DBT models for you to execute with the DBT Model. The models can be optimized before generation using the native build.

Dagster

Trilogy's Dagster engine builds Dagster assets and dependencies to execute with Dagster. The models can be optimized before generation using the native build.

Optimization

When Trilogy genereates a query, it will generate a series of 'nodes' needed to answer these questions. These include joins, aggregation, filtering, and deriving new concepts.

These nodes can be thought of as having a grain - a unique set of keys - and an arbitrary number of linked concepts. If multiple queries share nodes with the same 'keys', that's a potential opportunity to precalculate and benefit both queries.

TIP

A simple mental model for this is - identify all shared CTEs between two queries, prematerialize them. There's a bit more complexity than that, but not much!

To make this concrete, imagine a customer, order, and store table. The customer table has the state of the customer, and the store table has the state of the store.

If work in a marketing team specialized in generating recurring local sales, you might primarily care about sales that happen in stores to customers in the same state. All your queries might need to join these 3 to filter.

You might have queries you run each day that look something like this

Pull our recurring sales

SELECT
    order.store.state,
    order.date.week_of_year,
    count(order.id) order_count,
    --lag 1 order_count by order.store.state, order.date.week_of_year order by order.date.week_of_year asc ->last_week_sales,
    last_week_sales/order_count -> wow_orders
where
    order.customer.state = order.store.state
    and order.date>= dateadd(day, -7, order.date)
;

Get new customers


property order.window_start <- dateadd(day, -10, order.date);

activated_customers <- filter customer.id where count(order.id) ? order.date between dateadd(day, -365, order.date) and window_start;

SELECT
    customer.id,
    count(order.id) -> 
where
    order.customer.state = order.store.state
    and order.date>= window_start
    and customer_id not in (activated_customers)
;

Each of these is going to require joining stores, orders, and customers. If you schedule these, we can share a calculated aggregate between all of them at the highest common point, and the entire graph will be faster and cheaper.