Test, Test, Test

Overview

TPC-DS is a common industry benchmarking tool, consisting of a set of tables, set of queries for evaluations, and tooling to populate the tables.

What It's Good For (Trilogy)

We don't have to come up with test queries!

We use TPC-DS to help track two things:

  • Can we match the expressiveness of SQL or other alternatives that can cover the full range of TPC-DS?
  • Are the queries we generate for TPC-DS cases sufficiently close to the baseline performance in the duckdb TPC-DS set?

What It's Good for (Trilogy-NLP)

We use TPC-DS queries to help provide complicated examples to test the boundaries of the language.

What It's Good for (Trilogy-Transform)

High scale-factor databases (more data) generated by the duck-db extension help make performance benefits of pre-materialization of joins/aggregates more measurable.

Expressiveness

We've worked through about 25 of the queries so far. Base cases are retrieved from the duckdb extensionopen in new window and the official TPC-DS queriesopen in new window. We compare outputs to the duckdb extension outputs to guarantee correctness.

Performance

Currently, Trilogy is almost always as fast or faster than the raw queries in the duckdb extension. We would generally expect it to be equivalent and are working to ensure that stays the same.

Interesting Queries

Query 6

Query 6 is an interesting example; to get average price of all items, we need to ensure that we're using the full range of items. The Trilogy query originally used a CTE followed by a second model definition, but the implementation of having clause let all that be shifted inline. (see the blog post on the having clause for more details.)

SELECT a.ca_state state,
       count(*) cnt
FROM customer_address a ,
     customer c ,
     store_sales s ,
     date_dim d ,
     item i
WHERE a.ca_address_sk = c.c_current_addr_sk
  AND c.c_customer_sk = s.ss_customer_sk
  AND s.ss_sold_date_sk = d.d_date_sk
  AND s.ss_item_sk = i.i_item_sk
  AND d.d_month_seq =
    (SELECT DISTINCT (d_month_seq)
     FROM date_dim
     WHERE d_year = 2001
       AND d_moy = 1 )
  AND i.i_current_price > 1.2 *
    (SELECT avg(j.i_current_price)
     FROM item j
     WHERE j.i_category = i.i_category)
GROUP BY a.ca_state
HAVING count(*) >= 10
ORDER BY cnt NULLS FIRST,
         a.ca_state NULLS FIRST
LIMIT 100;
import store_sales as store_sales;
import item as item;

MERGE store_sales.item.id into item.id; # merge models for this query to get avg price of all items, not just sold items

WHERE
    store_sales.date.year=2001 
    and store_sales.date.month_of_year=1 
    and store_sales.item.current_price > 1.2 * avg(item.current_price) by item.category
SELECT
    store_sales.customer.state,
    count(store_sales.customer.id) as customer_count
HAVING
    customer_count>10
ORDER by
    customer_count asc nulls first,
    store_sales.customer.state asc nulls first
;