The data is hosted on the website as a bunch of CSV files. We need to do some preprocessing to expose these as tables, and then be able to quickly iterate and extend SQL queries to create our output datasets.

DuckDB can read csv, but these files can have enough warts that it's not super resilient or safe - it's a lot easier to manage the processing in python.

How can we do this?

We'll use the duckdb shell extension to expose python scripts written in uv style directly to duckdb, following this model.

Tips

uv is a fast Python package manager. uv scripts embed their dependencies directly in the file header, making them self-contained and reproducible without a separate requirements.txt.

An example script to ingest satellite data looks like this - we've delegated most of the work on processing the CSV into the arrow format we want to a shared helper. The .tsv is actually the full remote path on the server.

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.13"
# dependencies = ["pyarrow", "requests"]
# ///

from ingest_core import emit, ingest_gcat_file

if __name__ == "__main__":
    table = ingest_gcat_file("tsv/cat/satcat.tsv")
    emit(table)

Once we have that defined, we can define a trilogy datasource that will call that script when it needs to be resolved in a DuckDB query.

root datasource satcat_raw (
    JCAT: jcat,
    Satcat: satcat,
    Launch_Tag: launch.launch_tag,
    Piece: piece,
    Type: type,
    #...lots more fields...
    data_update_date: data_updated_through
    )
grain (jcat)
file `ingest_satcat.py`
freshness by data_updated_through;

One key callout there is the 'freshness by'. In our extraction logic we actually parse out the 'last updated through' text on the gcat homepage, and turn that into a date that is put onto the table.

Since this can be expensive in wall clock time, we don't want to hit the website every time. So we'll define some downstream sources that materialize the same fields to a parquet file in Google Cloud Storage (GCS) which we'll use for default access.

Using this, we can hook up all these sources to a generic trilogy 'refresh' command. When we run this, we'll check the root sources - hitting the website - and then the timestamps from our parquet files; if any are stale, we'll update them.

trilogy refresh data

Then we'll wire up a conditional output dump to JSON for when we do refresh, and use that JSON to serve our website!

Our daily updates then look like this - we can go ahead and schedule that in github actions, and we have an auto-updating website!

 Executing directory: data | Dialect: duck_db | Debug: disabled | Config: data/trilogy.toml

Starting parallel execution:
  Files: 15
  Dependencies: 22
  Max parallelism: 4
  Strategy: eager_bfs
  ✓ etl.preql (1.31s)
  ✓ sites.preql (6.64s) [1 datasource updated]
  ✓ engine.preql (6.70s) [1 datasource updated]
  ✓ platform.preql (6.78s) [1 datasource updated]
  ✓ organization.preql (6.96s) [1 datasource updated]
  ✓ stage.preql (4.17s) [1 datasource updated]
  ✓ vehicle.preql (7.23s) [2 datasources updated]
  ✓ launch.preql (13.62s) [1 datasource updated]
  ✓ core_local.preql (11.40s)
  ✓ debug.preql (11.58s)
  ✓ core.preql (11.75s)
  ✓ satellite.preql (19.20s) [1 datasource updated]
  ✓ satellite_data.preql (16.09s)
  ✓ satellite_data_local.preql (16.32s)
  ✓ debug_satellite.preql (16.74s)
Execution Summary:
  Total Scripts: 15
  Successful: 8
  Noop: 7
  Failed: 0
  Total Duration: 69.41s
  Datasources Updated: 9

Next Steps

With the parquet files built, we can actually set up dynamic exploration. We'll cover that in a followup.

Space Reporting

Space Reporting

Let's look at rocketry

What are we going to do?

Links

What's interesting?

Next Steps