Trilogy Transform
Trilogy Transform
Trilogy transform has a simple goal - minimize the need for a manually maintained list of staging tables to feed the data machine in a performant way.
Instead, focus on the outputs - the specific data products you own - and let machines handle the graph in between. We can do that efficiently and reliably by building the intermediate graph not on SQL, but on a more abstracted, declarative language. This means that there's no need to track intermediate state and label assets; every staging asset is a transient artifact.
Principles
- Incrementality - you can adopt it in portions
- Interoperability - we should support a range of modern backends
- Simplicity without compromise - expose sufficient configuration to support the the 80% of ETL uses cases, and offramps for the rest
State
Very experimental.
History
Modern ETL - starting with Airflow, and moving into DBT, Dagster, Prefect, and more - has provided a robust toolkit for building a graph of processing that moves raw data through an enrichment pipeline, producing steadily more "clean" data that eventually lands in "gold" datasets. There's another graph of processing after these to support specific cases, caching, and performance optimization.
Data mesh approaches pivot this to introduce federation, but the principles remain.
A first cut of these graphs is often clean and optimized - but as a company evolves and data changes, this graph becomes increasingly difficult to manage.
There's precedent in managing this with virtualization - eg - our whole graph is views - but that doesn't solve the fundamental composability problem of SQL.
Since Trilogy defines concepts individually and composes them, you can query against the graph, not tables. This naturally flattens your intermediate graph so that a maximum edge length is the max length of one of yoru calculations. You can't have a processing graph 20 nodes deep unless you actually define a business metric that has 20 layers of nested calculations.