2025-10-24
Refresher: How Denmark Statistics currently stores data.
Intro to Parquet file format.
Problems that Parquet solves.
For example, BEF register:
Challenge: Takes many minutes to load one year of data (in R).
Can you see the issue?
Variables are not consistent across years.
Finding the metadata is difficult.
Some variables are numeric but actually categorical.
E.g. Stata will create .dta files, doubling storage needs.
Most data formats are row-based, like CSV. Newer formats tend to be column-based.
…becomes…
Only need age? Only read that line:
| File type | Size (MB) |
|---|---|
SAS (.sas7bdat) |
1.45 Gb |
CSV (.csv) |
~90% of SAS |
Stata (.dta) |
745 Mb |
Parquet (.parquet) |
398 Mb |
Load in R with arrow package:
Loads all years in fraction of a second, compared to ~5 min for one year without using Parquet.
DuckDB https://duckdb.org/ is a recent powerful SQL engine designed for analytical queries.
(But we should be pushing for R or Python use anyway.)
DST charges for storage used.
Parquet loads multiple files in seconds, compared to minutes for other formats.
DST charges per user on a project.