Dynamic Aggregation
Dynamic Aggregation
What's the problem?
Raw data is collected by the business. It then needs to be presented and consumed, with low latency.
For small data, reporting and analytics can run directly off raw inputs. But as the complexity of transformations increases, and the scale of the data being operated on goes up, the latency of these reports goes up and up.
The classic solution is precomputation. You process all the data into the views that are needed, run that on a schedule, and users work off these assets.
This leads to the classic daily batch piplines seen in many companies - every day, dump out your operational stores, load them into a data ware/lakehouse, run your analytics, and give your users fresh data when they show up in the morning.
For bonus points, you can do shorter schedules (hourly) or dip your toes into streaming/near-real-time architectures.
And This Works
From OLAP cubes through to snowflake, this has worked mostly fine. There are some cost/complexity tradeoffs, but they're solvable - with some good design, the right team, and some classic elbow grease.
Until Things Change
But the constant bane of reporting still exists - change. When the business creates a new line of business; or they need a new model; or they have a new director who wants new cuts; or customer retention is flatlining and they need to have a sliceable deep-dive right away - the appropriate level of aggregation you used to have no longer works.
Now you need to go add a new dimension into your facts, wire it up, rebuild your aggregates, update the report.
This cycle can continue 2-3 times - making it very tempting to just throw more compute at the problem. If you have a big enough Bigquery or Snowflake instance - why do you need aggregates? Just build it on demand.
There's a Better Path
Companies like AtScale innovated in another direction - make your aggregations usage based. It's the best of both worlds. You build a flexible, granular, clean data model - and then based on predited or actual usage, you dynamically materialize aggregates on top of your raw data, and route queries that can be satisfied to those.
(Bigquery BI engine and others follow similar principles. )
This - for the right consumers, in the right tools - is a magical world. Their queries are fast; their dashboards load quick; they can drill down as far as they want.
But many of these tools limit you by building the resolution logic into a specific platform or engine.
With their focus on consumption, they're also blocked from one of the largest areas where DAGs calcify - data prep.
Trilogy Provides it For Everyone
We're - ambitiously - attempting to solve this globally at the sql layer.
For realtime access, using an updated model with aggregate sources will dynamically resolve to them.
For batch access, you can recompile your pipelines off updated models to dynamically create intermediate aggregates, and decouple your final reporting datasets from an upstream graph that exists purely for efficiency/deduplication.
We don't expect this to be able to go all the way back to roots - in the medallion model, this will likely just replace silver<>gold level processing - but that's already a major simplifaction for many data engineering teams.
Not Easy, Though
But how do you actually detect a duplicate calculation? Total revenue in one script might not match total revenue in another script.
What Trilogy attempts to do - though this process will undergo continuous refinement - is identify when two scripts are operating in an identical metadata environment for all relevant data used. Where they are, we can safe merge any upstream CTEs to produce shared upstream assets. By repeating this process for successive layers of preaggregation/calculation, we can build the optimal set of pre-computed aggregates.
This can be done either in a batch fashion - for ETL scripts - or dynamically during a day and update the served datasources for adhoc exploration, though this is an area of active investment in tooling.