Where in the warehouse is Carmen Sandiego's data?
Where in the warehouse is Carmen Sandiego's data?
When starting out on Trilogy, we imagined most of the work would be identifying common concepts, mapping them, and generating queries off of them.
On toy tests, this was true. On queries closer to real workloads, though, things started to break down. Shout out to TPC-* for being fake, but fake in interesting ways.
On actual data engineering projects, the problem completely changed shape.
The real complexity in the warehouse is completeness. A query is only accurate if the conceptual set it references is actually complete.
Completeness breaks down in three ways:
- Source Completeness
- Latency
- Unions
Source Completeness
This is the one we tackled earliest.
Imagine customers can register, and then order. You want to know how many customers you have, and how many orders they have. The customer table has an ID; the order table has a FK pointing at it.
If you query the order table only, you will not get customers with no orders.
So the order table contains a subset of the total set of customers; it is partial. (~ in trilogy syntax).
This kind of completeness is relatively easy to work around. You need to get the full set, then pull in enrichment.
So to get orders + customers, we should query the customer table and left join the orders.
But what if you want the count of customers who ordered in august? As soon as you add a filter referencing an order concept, the order table actually contains the complete set and you do not need to inject the join to customers at all!
Latency
Most warehouses are in practice eventually consistent caches on top of some ingest. You get root data that comes in on a schedule or a stream; you then process it in batches or micro-batches to produce output; people consume output.
There is end-to-end latency. Assets closer to the start are often more complete than those downstream.
To make this manageable, Trilogy models watermarks, freshness fields, and probes. These all represent the same general concept: can we determine whether an asset is stale relative to the freshest thing in the warehouse?
- Watermarks: if an asset is downstream of a source, and the source has been updated more recently than the downstream asset, the downstream asset is stale.
- Freshness fields: probe the warehouse for the max of a value. Anything below that threshold is stale.
- Probes: custom logic to determine whether an asset is stale.
Latency modeling was not a core target at the beginning, and we are still simplifying it.
Unions
This is the one that snuck up on us. Say you are ingesting data from two CRMs. You want to normalize the data and merge it.
You stage the data from both. You can query individually. How do you query the union? Well, you map them both to the same concepts, and mark them partial.
But now neither has the complete set of data. How do we actually answer the question of 'the complete set of outreaches'?
We need some way to capture that each one is a subset, but a well-defined subset.
We handle this with 'complete for' clauses on datasources. These define the conditions for which the source is a complete set.
If we can assemble a MECE set of datasources in the warehouse, we can union them to get our full set. And if we have a condition matching a subset, we can resolve from that directly.
See It Resolve
At the end of the day, this means the same semantic query can resolve to different physical assets depending on what makes the set complete.
The three cases are the important ones:
- Query where
source = 'CRM_A'resolves directly from theCRM_Aasset. - Query where
source = 'CRM_B'resolves directly from theCRM_Basset. - Query for the full CRM total resolves by unioning both base assets into a complete set.
This sounds simple written out, but it is one of the places where warehouse semantics stop being about joins and start being about correctness.