Overview
TCP-DS is a common industry benchmarking tool, consisting of a set of tables, set of queries for evaluations, and tooling to populate the tables.
What It's Good For
We use TCP-DS to help track two things:
- Can we match the expressiveness of SQL or other alternatives that can cover the full range of TCP-DS?
- Are the queries we generate for tcp-ds cases sufficiently close to the baseline performance in the duckdb tcp-ds set?
Expressiveness
We've worked through about 16 of the queries so far. Base cases are retrieved from the duckdb extension and the official TCP-DS queries. We compare outputs to the duckdb extension outputs to guarantee expressiveness.
Performance
Currently, Trilogy is generally slower than the base duckdb queries by 10-20%, and is occasionally faster. The differences often come down to the ability to easily push down filters as early in the process as possible, as well as the ordering of certain operations.
Interesting Queries
Query 6
Query 6 is an interesting example; to get average price of all items, we need to ensure that we're using the full range of items. The Trilogy query originally used a CTE followed by a second model definition, but the implementation of having clause let all that be shifted inline. (see the blog post on the having clause for more details.)
SELECT a.ca_state state,
count(*) cnt
FROM customer_address a ,
customer c ,
store_sales s ,
date_dim d ,
item i
WHERE a.ca_address_sk = c.c_current_addr_sk
AND c.c_customer_sk = s.ss_customer_sk
AND s.ss_sold_date_sk = d.d_date_sk
AND s.ss_item_sk = i.i_item_sk
AND d.d_month_seq =
(SELECT DISTINCT (d_month_seq)
FROM date_dim
WHERE d_year = 2001
AND d_moy = 1 )
AND i.i_current_price > 1.2 *
(SELECT avg(j.i_current_price)
FROM item j
WHERE j.i_category = i.i_category)
GROUP BY a.ca_state
HAVING count(*) >= 10
ORDER BY cnt NULLS FIRST,
a.ca_state NULLS FIRST
LIMIT 100;
import store_sales as store_sales;
import item as item;
merge store_sales.item.id into item.id; # merge models for this query to get avg price of all items, not just sold items
SELECT
store_sales.customer.state,
count(store_sales.customer.id) as customer_count
WHERE
store_sales.date.year=2001
and store_sales.date.month_of_year=1
and store_sales.item.current_price > 1.2 * avg(item.current_price) by item.category
HAVING
customer_count>10
ORDER by
customer_count asc nulls first,
store_sales.customer.state asc nulls first
;