Test, Test, Test

Overview

TPC-DS is a common industry benchmarking tool, consisting of a set of tables, set of queries for evaluations, and tooling to populate the tables.

Tips

Just here for the queries? The DuckDB TPC-DS extension is a great place to grab them and makes it easy to generate test data. Otherwise, the official specs (look at the specifications PDF) have more detail.

TPC-DS has a fun normalized set of tables and can be easily generated by tools - or is provided by DBs directly - making it a good starting dataset to experiment with. We cannot

We cannot emphasize how easy the duckdb extension makes it - just run the below (with your scale factor (sf) of choice) to get a full, populated schema. This is a great way to get DDLs as well.

INSTALL tpcds;
LOAD tpcds;
SELECT * FROM dsdgen(sf=.5);

What It's Good For (Trilogy)

We don't have to come up with test queries!

We use TPC-DS to help track two things:

Can we match the expressiveness of SQL or other alternatives that can cover the full range of TPC-DS?
Are the queries we generate for TPC-DS cases sufficiently close to the baseline performance in the duckdb TPC-DS set?

What It's Good for (Trilogy-NLP)

We use TPC-DS queries to help provide complicated examples to test the boundaries of the language.

What It's Good for (Trilogy-Transform)

High scale-factor databases (more data) generated by the duck-db extension help make performance benefits of pre-materialization of joins/aggregates more measurable.

Expressiveness

We've worked through about 25 of the queries so far. Base cases are retrieved from the duckdb extension and the official TPC-DS queries. We compare outputs to the duckdb extension outputs to guarantee correctness.

Performance

Currently, Trilogy is almost always as fast or faster than the raw queries in the duckdb extension. We would generally expect it to be equivalent and are working to ensure that stays the same.

Interesting Queries

Query 3 - Brand Sales

This is a fairly straightforward query in both languages.

SQL

SELECT dt.d_year,
       item.i_brand_id brand_id,
       item.i_brand brand,
       sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,
     store_sales,
     item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
  AND store_sales.ss_item_sk = item.i_item_sk
  AND item.i_manufact_id = 128
  AND dt.d_moy=11
GROUP BY dt.d_year,
         item.i_brand,
         item.i_brand_id
ORDER BY dt.d_year,
         sum_agg DESC,
         brand_id
LIMIT 100;

Trilogy

Trilogy looks very similar, without the group by.

import store_sales as store_sales;

WHERE 
    store_sales.date.month_of_year=11 and store_sales.item.manufacturer_id=128
SELECT
    store_sales.date.year,
    store_sales.item.brand_id,
    store_sales.item.brand_name,
    sum(store_sales.ext_sales_price)->sum_agg
ORDER BY 
    store_sales.date.year asc,
    sum_agg desc,
    store_sales.item.brand_id asc
limit 100;

Query 1 - Customer Returns

This is a more complicated query.

SQL

WITH customer_total_return AS
  (SELECT sr_customer_sk AS ctr_customer_sk,
          sr_store_sk AS ctr_store_sk,
          sum(sr_return_amt) AS ctr_total_return
   FROM store_returns,
        date_dim
   WHERE sr_returned_date_sk = d_date_sk
     AND d_year = 2000
   GROUP BY sr_customer_sk,
            sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1,
     store,
     customer
WHERE ctr1.ctr_total_return >
    (SELECT avg(ctr_total_return)*1.2
     FROM customer_total_return ctr2
     WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
  AND s_store_sk = ctr1.ctr_store_sk
  AND s_state = 'TN'
  AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;

Trilogy

Here, we'll use an inline calculation of total returns by customer /store, and then reuse that immediately to calculate the average customer returns by store.

Since these are newly derived aggegates, we'll filter them in the having clause.

import store_returns as returns;

WHERE
    returns.store.state = 'TN' and
    returns.return_date.year = 2000
SELECT
    returns.customer.text_id,
    --sum(returns.return_amount)-> total_returns, # returns by store
    --returns.store.id, # this query implicity groups by text_id and store, but only returns text_id
    --avg(total_returns) by returns.store.id -> avg_store_returns # verage off the previous calc
HAVING
   total_returns > (1.2*avg_store_returns)
ORDER BY 
    returns.customer.text_id asc
limit 100;

Query 6

Query 6 is an interesting example; to get average price of all items, we need to ensure that we're using the full range of items. The Trilogy query originally used a CTE followed by a second model definition, but the implementation of having clause let all that be shifted inline. (see the blog post on the having clause for more details.)

SELECT a.ca_state state,
       count(*) cnt
FROM customer_address a ,
     customer c ,
     store_sales s ,
     date_dim d ,
     item i
WHERE a.ca_address_sk = c.c_current_addr_sk
  AND c.c_customer_sk = s.ss_customer_sk
  AND s.ss_sold_date_sk = d.d_date_sk
  AND s.ss_item_sk = i.i_item_sk
  AND d.d_month_seq =
    (SELECT DISTINCT (d_month_seq)
     FROM date_dim
     WHERE d_year = 2001
       AND d_moy = 1 )
  AND i.i_current_price > 1.2 *
    (SELECT avg(j.i_current_price)
     FROM item j
     WHERE j.i_category = i.i_category)
GROUP BY a.ca_state
HAVING count(*) >= 10
ORDER BY cnt NULLS FIRST,
         a.ca_state NULLS FIRST
LIMIT 100;

import store_sales as store_sales;
import item as item;

MERGE store_sales.item.id into item.id; # merge models for this query to get avg price of all items, not just sold items

WHERE
    store_sales.date.year=2001 
    and store_sales.date.month_of_year=1 
    and store_sales.item.current_price > 1.2 * avg(item.current_price) by item.category
SELECT
    store_sales.customer.state,
    count(store_sales.customer.id) as customer_count
HAVING
    customer_count>10
ORDER by
    customer_count asc nulls first,
    store_sales.customer.state asc nulls first
;

Query 99

Query 99 highlights where trilogy can force a query to be more descriptive. The query 99 prompt is "For catalog sales, create a report showing the counts of orders shipped within 30 days, from 31 to 60 days, from 61 to 90 days, from 91 to 120 days and over 120 days within a given year, grouped by warehouse, call center and shipping mode.

However, the actual query just sums rows from the catalog_sales table, which has a row per order_number and item id. Is the 'count of shipped orders' the count of shipped orders + products?

SELECT w_substr ,
       sm_type ,
       LOWER(cc_name) cc_name_lower ,
       sum(CASE
               WHEN (cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1
               ELSE 0
           END) AS "30 days",
       sum(CASE
               WHEN (cs_ship_date_sk - cs_sold_date_sk > 30)
                    AND (cs_ship_date_sk - cs_sold_date_sk <= 60) THEN 1
               ELSE 0
           END) AS "31-60 days",
       sum(CASE
               WHEN (cs_ship_date_sk - cs_sold_date_sk > 60)
                    AND (cs_ship_date_sk - cs_sold_date_sk <= 90) THEN 1
               ELSE 0
           END) AS "61-90 days",
       sum(CASE
               WHEN (cs_ship_date_sk - cs_sold_date_sk > 90)
                    AND (cs_ship_date_sk - cs_sold_date_sk <= 120) THEN 1
               ELSE 0
           END) AS "91-120 days",
       sum(CASE
               WHEN (cs_ship_date_sk - cs_sold_date_sk > 120) THEN 1
               ELSE 0
           END) AS ">120 days"
FROM catalog_sales ,
  (SELECT SUBSTRING(w_warehouse_name,1,20) w_substr, *
   FROM warehouse) AS sq1 ,
     ship_mode ,
     call_center ,
     date_dim
WHERE d_month_seq BETWEEN 1200 AND 1200 + 11
  AND cs_ship_date_sk = d_date_sk
  AND cs_warehouse_sk = w_warehouse_sk
  AND cs_ship_mode_sk = sm_ship_mode_sk
  AND cs_call_center_sk = cc_call_center_sk
GROUP BY w_substr ,
         sm_type ,
         cc_name
ORDER BY w_substr  NULLS FIRST,
         sm_type  NULLS FIRST,
        cc_name_lower NULLS FIRST
LIMIT 100;

Trilogy forces this to be explicit - in the below example. we'll define a composite key for this to count. Overall, the query is still slightly more concise due to the eliminiation of joins/group by.


import catalog_sales;

auto warehouse_short_name <- substring(warehouse.name, 1, 20);
auto catalog_pk <- order_number::string || item.id::string;

WHERE ship_date.month_seq BETWEEN 1200 and 1211
SELECT
    warehouse_short_name,
    ship_mode.type,
    lower(call_center.name) ->cc_name_lower,
    COUNT(catalog_pk ? 
            days_to_ship <= 30
        ) AS less_than_30_days,
    COUNT(catalog_pk ? 
        days_to_ship > 30 AND days_to_ship <= 60
        ) AS between_31_and_60_days,
    COUNT(catalog_pk ? 
        days_to_ship > 60 AND days_to_ship <= 90
        ) AS between_61_and_90_days,
    COUNT(catalog_pk ? 
        days_to_ship > 90 AND days_to_ship <= 120
        ) AS between_91_and_120_days,
    COUNT(catalog_pk ? 
        days_to_ship > 120
        ) AS over_120_days
ORDER BY
    warehouse_short_name asc nulls first,
    ship_mode.type asc nulls first,
    cc_name_lower asc nulls first;