Catalog · pipeline
How catalog bundles are built
Where the metadata comes from, who translates it, and the gate it has to clear before it ships.
RegiStream does not generate metadata. Each register agency is the system of record for its own catalog; our job is to homogenize the shape, add an English projection where one isn't already published, and refuse to ship anything we can't verify.
What you download is a directory of CSVs in the autolabel schema v2 format. No API key, no auth, no proprietary container. Every row carries provenance back to the agency it came from.
Six agencies, six source shapes
Every agency exposes its register catalog differently. We honour the shape of each source rather than forcing one ingest path:
| Agency | Country | Source language | Ingest path |
|---|---|---|---|
| Statistics Sweden (SCB) | Sweden | Swedish | HTML walk of the microdata catalog |
| Statistics Denmark (DST) | Denmark | Danish | HTML walk of the research extranet |
| Statistics Norway (SSB) | Norway | Norwegian | KLASS classifications API + Vardok + variabellister |
| Försäkringskassan (FK) | Sweden | Swedish | Excel ingest with header-alias detection |
| Socialstyrelsen (SOS) | Sweden | Swedish | Excel + HOSP code-lists + HTML |
| Statistics Iceland (Hagstofa) | Iceland | Icelandic | Bilingual research database + Lýsigögn (ESMS) |
Where an agency already publishes English (Hagstofa, parts of KLASS), we keep what's there and only fall through to translation for what isn't bilingual upstream. Every variable carries a row-level variable_source tag identifying which ingest path it came from, so researchers can filter by provenance.
The pipeline
Eight stages, identical across all six agencies. Each stage reads and writes typed tables; rerun any stage without redoing earlier work.
- Parse — read raw scrape artifacts (HTML, Excel, JSON) into staging tables.
- Dedup — content-hash value sets so identical code-to-label mappings collapse to one row, referenced by many variables.
- Group — promote staging into typed tables: registers, variables, value sets.
- Normalize — apply the bundle inclusion rule (a variable needs at least a label, a definition, or a value set), drop empty registers, scrub fake fallback labels. Runs before translation so the LLM never spends GPU time on rows we discard.
- Translate — LLM translation of native-language labels, definitions, and value-set codes into English, glossary-guided.
- Retranslate-missing — idempotent fix-up; reruns only on rows whose translation is missing or a suspect source-language passthrough. Never overwrites good translations.
- Audit — coverage and correctness gate (see below). Hard-fails on unflagged passthrough, glossary conflicts, or register-name gaps.
- Export & package — emit schema v2 CSVs and ZIP them. The audit runs as pre-flight inside packaging; a failed audit means no ZIP.
The order is load-bearing: normalize-before-translate means LLM work is never wasted, and audit-as-gate means we cannot ship a bundle that fails its own coverage check.
Translation
The translator runs locally on a single GPU: no OpenAI, no Anthropic, no Google, no commercial MT service. The primary model is Llama 3.1 8B Instruct (with Mistral 7B as a fallback), executed via llama.cpp. The model loads into GPU memory once at startup and stays resident for the entire translation run.
Translation is value-set-as-unit: when translating the codes in a categorical variable (e.g. an occupation list), the whole set is sent to the model in one structured request so sibling labels stay internally consistent. Each request carries the variable's name and definition, the register it belongs to, up to ten previously-translated siblings for terminology continuity, and matched glossary hints for that chunk. Input and output are JSON.
Glossaries are per source language. High-confidence entries (≥ 0.95) override LLM output and the audit gate hard-fails on any stored translation that conflicts with one. Mid-confidence entries are injected as hints. For SSB the glossary is auto-bootstrapped from KLASS's bilingual data and topped with a hand-curated overrides file.
Two non-negotiables
- Fail loud, never fall back to source. If translation fails on a row, it is marked
untranslatedand surfaced in the export. We do not emit the Swedish, Danish, or Norwegian word as if it were English. Source-language passthrough is the failure mode we most distrust because it is invisible to a non-native English reader. - Register names follow a curated glossary first, LLM second. Where a registers glossary exists for a domain, the audit gate hard-fails if any register row is missing. This prevents the LLM from hallucinating register names or leaving the source-language name in a bilingual-named field.
The audit gate
Before a bundle is packaged, an audit pass runs and hard-fails on:
- Unflagged source-language passthrough — a row whose stored English looks identical to the native source and is not marked
untranslated. - Glossary conflict — a stored translation that contradicts a high-confidence glossary entry.
- Register-name gaps — when a domain has a curated registers glossary, every register row must be filled.
Sampled QA reviews are not coverage proof. A reviewer eyeballing 200 rows out of 80,000 cannot certify the other 79,800. The audit gate is the proof; sampled reviews are sanity checks on top.
The contract this enforces, all the way through:
DB row count = CSV row count = website count.
If a row appears in the catalog at registream.org, it has at least one piece of real metadata behind it.
Output format
Each language ships as a directory of five CSVs, ZIPped: manifest, scope, release_sets, variables, value_labels. Two ZIPs per domain (native and English) sharing the same row IDs so a researcher can join them on variable_id and value_set_id.
See the schema v2 reference for the column-level contract, and the catalog for downloads.
What we don't ship
Deliberate non-features, called out so they aren't discovered the hard way:
- No microdata. RegiStream ships only metadata about variables and codes. We have no access to and do not redistribute any individual-level register data.
- No interpretation. Definitions are quoted from the publisher; we don't paraphrase them or add editorial commentary in English.
- No browsable variable pages. Per-variable browsing belongs at the source agency, who does it well in context. The bundle is a machine-readable surface for autolabel, not a portal.