To Have or Have Not

TLDR

Just like SQL< Trilogy has two filtering options; where and having. However, they serve a different purpose - the where clause is applied before any aggregations and can include any field, including those not selected on; the having clause is applied to an output projection and can only include fields in the output. But it wasn't always this way - this blog is about how we got there.

The Having Clause

Trilogy originally dropped the having clause entirely. This was in the service of simplicity; since a query would have a where clause that operated on the output projection, all values would be available for filtering and the where/having distinction was unneeded.

However, performance and authoring simplicity started to challenge that assumption. We've brought back the having clause, with a new purpose.

Implicit Where

The restrict to filtering on the output projection only made some authoring statements quite tedious. Since there was no implicit 'table' in the from clause, it was no longer possible to straightforwardly filter on a value that you didn't need to output.

While this could be worked around with hidden fields, it didn't feel great to write - it's tedious and requires you to write a field twice for no reason.

SELECT
    name,
    --age,
WHERE
    age>30
;

Implicit where projection was our attempt to solve for this. For the above query, we could infer that age was referenced as a filter, so that we need a first projection with name + age to apply the filter to. We could then optionally group/restrict our values down to return the original projection set.

Now we can write:

SELECT
    name,
WHERE
    age>30
;

That's better! We no longer have a redundant declaration.

The Aggregate Issue

However, aggregates present a problem. Imagine the following query:


SELECT
    age,
    count(person.id) ->person_count
WHERE
    state in ('vt', 'ma')
;

Does the user mean?

SELECT
    age,
    --state,
    count(person.id) ->person_count
WHERE
    state in ('vt', 'ma')
;

Likely not. They'll have duplicate rows for each state. To get the reduced set, the user would need to first define a CTE.

WITH restricted as SELECT
    age,
    state,
    person.id
WHERE
    state  in ('vt', 'ma');

SELECT
    restricted.age,
    count(restricted.id) ->person_count ;

This starts to defeat the purpose of the implicit where. If every aggregate forces you to define a CTE first, authoring feels slow and disjointed and aggregates are once more "special". You have to rewrite your entire query for a relatively simple conceptual change.

Can we fix it?

Select Context

Aggregates already respond to the query context they exist in, such as being assigned a grain of the other dimensions. We can extend this to include the conditions as well.

If what the user really means is this:

SELECT
    age ? state in ('vt', 'ma'),
    count(person.id ? state in ('vt', 'ma')) ->person_count

Then we can evaluate that automatically from a global where clause.

So now we get the right output for a condition that is not referenced in the select:

select
    age,
    count(person.id) -> person_count,
where 
    state in ('vt', 'ma')
;

which could preferably be written (and should be in Trilogy) with the where first:

WHERE
    state in ('vt', 'ma')
SELECT
    age,
    count(person.id) -> person_count,
;

But what if the query is instead this?

WHERE
    state in ('vt', 'ma')
    and
    person_count>10
SELECT
    age,
    count(person.id) -> person_count,

We can't push the condition clause into the aggregate definition because it's a circular dependency - the pushdown of the person_count would depend on the person_count itself!

Back to Having

To resolve this, we've reintroduced the having clause. It is restricted to filtering on the output of the select, and will not propagate context implicity up.

To promote a more natural reading order that matches operations, this is also why it's preferable for the where clause to come before the select.

Our final query might look like this:

-- the conceptual universe we're operating in
WHERE
    state in ('vt', 'ma')
-- define the projection
SELECT
    age,
    count(person.id) -> person_count,
-- and filter the results
HAVING
    person_count>10

Now we have a natural order - the where clause filters the space of records, the select clause defines the projection, and the having clause filters the output.

Future Steps

It's currently not strictly enforced on what conditions can be placed in each clause, as scalar values derived in the select can be safely pushed up if they don't have circular dependencies.

We may want to tighten this in the future, but for now we will error if an invalid condition is placed in the where clause.