Datasources
Datasources
Datasources, datasinks, data - canonically called datasources in Trilogy - all represent the same thing. Materialized or on-demand-computed data.
They are what bind the logical computation model to physical resources, and enable you to move data in and out of the logical model.
Backends
Datasources support three backend types, each documented in their own page:
- Database — tables and queries in a connected database (e.g. DuckDB, BigQuery, Snowflake)
- File — local or remote files (Parquet, CSV, JSON)
- Python — UV-style Python scripts that emit an Arrow table
Formatted Addresses (f-strings)
Any address or file path can be written as a backtick string prefixed with f, enabling concept interpolation. This works across all backend types — database addresses, file paths, and remote URLs.
# version variable drives the path at query time
datasource my_table (...)
grain (id)
file f`https://storage.googleapis.com/my-bucket/data_v{data_version}.parquet`;
Roots
Root datasources - labeled with the prefix root - represent data arriving into your system that is not managed by Trilogy.
Trilogy will generally not operate on these, and uses them as canonical watermarks for concepts bound to them.
Tips
For example, if you are importing in external data with a PK into your warehouse; that might be the 'root' datasource for all your derived computations from that data.
root datasource (...);
Standard
A standard datasource can be either read or written to by Trilogy, depending on your exact statements.
Datasources may be marked with one of two fields
Incremental By
Defines an incrementing key to check freshness. Typicaly examples here could include integer PKs or dates. An incremental datasources can be appended to using this key.
datasource (...)
incremental by <field>;
Freshness By
Defines a watermark column; a stale source will be rebult.
datasource (...)
freshness by latest_landmark_update_through;
Tips
Use freshness when a datasource needs to be entirely rebuilt; use incrementality when you want to do optimized incremental loading.
Partial
Partial is an optional keyword that marks every bound field on the datasource partial - this removes a source of error when managing concepts.
Partial datasources often make the most sense when you have many roots that are combined to produce one full dataset, such as an archive and recent table, or multiple sources that are merged into one canonical dataset. (tree datasets from multiple cities.)
A partial source will almost always have a 'complete where' modifier, which marks the filter condition for which it is "complete" - the full dataset.
Tips
When a complete where clause matches a query filter, queries can be optimally resolved directly from partial sources - they are complete! Trilogy will always attempt to push down in this way when possible.
partial datasource (...)
complete where field=thing;
Lifecycle
Persist
Datasources can be managed one of two ways. They can be directly modified via persist statements; this is similar to running an insert statement in SQL. Full and incremental persists are possible.
persist into <datasource> from <select>
Refresh
When running from the CLI, a data model can also be 'refreshed' - this will watermark all datasource incrementality fields and update anything that is stale.
This is a more asset-oriented model that minimizes computation and is recommended when possible.
trilogy refresh <path_to_folder>
Tips
Persist vs Refresh is imperative vs declarative; we recommend using declarative whenever it makes sense, and especially for managinag warehouse processing and updates. Persist can be useful for adhoc scripts or exports.
