Autolabel schema v2 — RegiStream

Docs
autolabel
Schema v2

What the schema is

An autolabel-compatible bundle contains 5 CSV files per language for each catalog domain (e.g. scb, dst, ssb):

File	Role
`{domain}_manifest_{lang}.csv`	Manifest (config): schema version, scope hierarchy names, publisher info
`{domain}_scope_{lang}.csv`	Atomic manifest of `(scope, release)` instances
`{domain}_variables_{lang}.csv`	Central facts table — one row per stable variable-metadata era, two integer FKs
`{domain}_value_labels_{lang}.csv`	Content-hashed value sets, both JSON and Stata-native format columns
`{domain}_release_sets_{lang}.csv`	Junction table linking release sets to scope atoms

All data files are pure rectangular CSVs with scalar columns. value_labels.csv carries two equivalent views of each value set — a JSON column and a pre-formatted Stata string — so consumers pick whichever fits their tool. All cross-file references are integer foreign keys. No inline text duplication. No pipe-delimited multi-value cells.

CSV conventions

Delimiter: semicolon (;) — avoids quoting issues with JSON content and Scandinavian text
Quoting: double-quote (") with inner quotes doubled ("")
Encoding: UTF-8
No pipe-separated multi-value cells — all cells are scalar

Two hierarchical axes

Every variable's identity is defined by two axes:

scope — what slice of reality the variable describes. Stored as separate columns scope_level_1, scope_level_2, etc. (e.g. scope_level_1 = LISA, scope_level_2 = Individer_16_och_äldre). Depth declared per domain in the manifest.
release — which published release instance. Opaque string (e.g. 2005, Höstterminen_2020, 2019_lone_formansutbetalningar).

Scope depth is dynamic — declared per domain in the manifest. SCB uses 2 levels ("Register" / "Variant"); SSB uses 2 differently-named levels ("Source" / "Group"); a future agency could use 4. autolabel reads the manifest at runtime to drive UI labels and browse hierarchy — no code changes needed.

Core vs. augmentation files

The five files split into two tiers:

Core (required): variables.csv + value_labels.csv. A consumer can label every variable from these two files alone, using a majority-label rule.
Augmentation (recommended): scope.csv + release_sets.csv + manifest.csv. The precision layer — enables scope/release filtering and browse hierarchy.

A conformant consumer MUST work with just the core files (graceful degradation). A conformant producer SHOULD emit all five for full precision.

1. Manifest — `{domain}_manifest_{lang}.csv`

Flat key-value CSV declaring domain metadata and scope hierarchy semantics. One file per language (titles are localized).

Columns

Column	Type	Required	Description
`key`	string	yes	Configuration key
`value`	string	yes	Value for the key

Required keys

domain — catalog domain identifier (e.g. scb)
schema_version — wire-format version (currently 2.0)
publisher — publishing organization name
bundle_release_date — ISO 8601 date
languages — pipe-separated list (e.g. swe|eng)
scope_depth — integer N, the hierarchy depth for this domain
scope_level_1_name … scope_level_N_name — short machine-readable level name
scope_level_1_title … scope_level_N_title — human-readable level title in this file's language

Example — SCB English manifest

key;value
domain;scb
schema_version;2.0
publisher;Statistics Sweden (SCB)
bundle_release_date;2026-04-16
languages;swe|eng
scope_depth;2
scope_level_1_name;register
scope_level_1_title;Register
scope_level_2_name;variant
scope_level_2_title;Variant

Example — SSB English manifest (different naming)

key;value
domain;ssb
schema_version;2.0
publisher;Statistics Norway (SSB)
bundle_release_date;2026-04-20
languages;nob|eng
scope_depth;2
scope_level_1_name;source
scope_level_1_title;Source
scope_level_2_name;group
scope_level_2_title;Group

2. Scope — `{domain}_scope_{lang}.csv`

Atomic manifest. One row per (scope, release) combination.

Columns

Column	Type	Required	Description
`scope_id`	integer	yes (PK)	Primary key, sequentially assigned during ingestion
`scope_level_1`	string	yes	Full name of scope level 1 (e.g. register name)
`scope_level_1_alias`	string	no	Optional short alias (e.g. `LISA`)
`scope_level_1_description`	string	no	Free-text description
`scope_level_2`	string	conditional	Second-level name. Required when `scope_depth >= 2`
`scope_level_2_alias`	string	no	Optional short alias for level 2
`scope_level_2_description`	string	no	Free-text description of level 2
…	…	…	Additional `scope_level_N` / `_alias` / `_description` for N = `scope_depth`
`release`	string	yes	Atomic release identifier — opaque string
`release_description`	string	no	Description specific to this release
`population_date`	string	no	ISO 8601 date
`measurement_info`	string	no	Free-text description of data collection

Rules

Primary key: scope_id
Uniqueness: (scope_level_1, ..., scope_level_N, release) unique within a domain
Every row MUST have scope_level_1 populated. Deeper levels MAY be empty per-row
All columns are scalar strings or integers. No JSON columns. No slash-delimited compound values

Release axis rules

release is an opaque string — year (2005), academic term (Höstterminen_2020-Vårterminen_2021), calendar date (2014-10-15), quarter (2005_Q1), content-tagged edition, etc.
Each atomic release is one row. No pipe-delimited multi-release strings. Non-contiguous sequences (e.g. election years) get one row per year; gap years are absent.
No range strings like 1990-2009. Use 20 rows, one per atomic year.

3. Variables — `{domain}_variables_{lang}.csv`

Central facts table. One row per unique variable-metadata era.

Columns

Column	Type	Required	Description
`variable_name`	string	yes	Column name in the raw data
`variable_label`	string	yes	Human-readable label
`variable_definition`	string	no	Full definition text
`variable_unit`	string	no	Unit of measurement
`variable_type`	enum	yes	`categorical` / `continuous` / `text` / `date` / `identifier`
`variable_description`	string	no	Long-form description
`variable_source`	string	no	Source authority
`variable_external_comment`	string	no	External-facing comment
`datatype`	string	no	Storage type (`char(1)`, `int`, `float`, …)
`value_label_id`	integer	no	FK → `value_labels.csv` (null for non-categorical)
`release_set_id`	integer	yes	FK → `release_sets.csv` (via junction)

Rules

Primary key: (variable_name, datatype, value_label_id, release_set_id)
No scope columns. Scope is resolved via the FK chain: release_set_id → release_sets → scope_id → scope.scope_level_N
Metadata drift handling: when any variable-level attribute changes across releases, the variable splits into multiple rows, each pointing at its own release_set_id covering exactly the releases where that metadata applies

4. Value labels — `{domain}_value_labels_{lang}.csv`

Content-hashed lookup of unique value sets.

Columns

Column	Type	Required	Description
`value_label_id`	integer	yes (PK)	Content-hash-derived primary key
`variable_name`	string	no	Representative variable (informational only)
`value_labels_json`	string	yes	JSON dict `{"1":"Man","2":"Kvinna"}`
`value_labels_stata`	string	yes	Pre-formatted: `"1" "Man" "2" "Kvinna"` — direct input to Stata's `label define`
`code_count`	integer	yes	Number of distinct codes

Two equivalent views

Tools with native JSON parsing (Python pandas, R, DuckDB, JavaScript) MAY parse value_labels_json. Tools without convenient JSON parsing (Stata, shell pipelines, spreadsheet apps) MAY use value_labels_stata. Both encode the same value set.

Content-hashing rule

value_label_id is a stable integer hash over the canonical-language (English) normalized value set. Swedish and English files share the same IDs for conceptually identical sets.

Match by FK, not by name

Look up the variable in variables.csv to get value_label_id, then look up that ID in value_labels.csv. Do NOT match on variable_name (it's informational only).

5. Release sets — `{domain}_release_sets_{lang}.csv`

Junction table. One row per (release_set_id, scope_id) pair.

Columns

Column	Type	Required	Description
`release_set_id`	integer	yes	Content-hash-derived ID for the release set
`scope_id`	integer	yes	FK → `scope.scope_id`

Rules

Primary key: composite (release_set_id, scope_id)
Content-hashing: release_set_id is a stable integer hash over the sorted ascending list of scope_ids in the set. All variables.csv rows that apply to exactly the same set of scope atoms share the same release_set_id
Scope invariant: within a single release_set_id, all referenced scope_ids MUST resolve to the same scope (identical scope_level_* values)

Referential integrity

A conformant bundle MUST satisfy:

Every non-null value_label_id in variables.csv exists in value_labels.csv
Every release_set_id in variables.csv has at least one row in release_sets.csv
Every scope_id in release_sets.csv exists in scope.csv
Within each release_set_id, all referenced scope_ids resolve to the same scope (identical scope_level_* values)
Every scope.csv row has scope_level_1 populated; deeper levels may be empty per-row
The manifest file exists for every language listed in its languages key

`schema_version` — wire-format gate

Every autolabel-readable dataset carries a schema_version marker:

Stata: char _dta[schema_version] "2.0"
Python / pandas: df.attrs['schema_version'] = '2.0'
R / haven: attr(df, 'schema_version') <- '2.0'

This is a runtime compatibility check — autolabel reads it to decide whether an installed tool version can parse the bundle. A future breaking change (e.g. required column added) would bump schema_version to 3.0, and the tool-version gate would reject the new bundle until the tool upgrades.

Multilingual convention

Bundles ship one set of 5 files per language:

{domain}_manifest_{lang}.csv
{domain}_scope_{lang}.csv
{domain}_variables_{lang}.csv
{domain}_value_labels_{lang}.csv
{domain}_release_sets_{lang}.csv

English is REQUIRED at the field level. Native-language files are REQUIRED where the source is non-English. Additional languages are OPTIONAL.

IDs are canonical across language files. scope_id, value_label_id, and release_set_id MUST be identical across all language variants. Only localized text columns differ.

Manifest titles are localized per file. Level names (lowercase, machine-readable) are canonical and MUST match across language files; titles (human-readable) differ per language.

Extension namespacing

Producers MAY add extension columns to any data file or extension keys to the manifest. Extension names MUST use a colon-prefix matching the catalog domain:

scb:ext_ansvarig_enhet (column)
scb:ext_internal_notes (manifest key)
dst:ext_hq_flag
healthdcat:ext_access_rights

Consumer tools MUST ignore unrecognized namespaced columns and keys. This enables non-breaking extension without bumping schema_version.

Design rationale

Why CSV-only on-disk with companion columns for structured content

CSV is the lowest common denominator of tabular-data interchange. Every statistical package, every programming language, every data pipeline reads rectangular CSV natively. JSON is convenient in some environments (Python, JavaScript) and inconvenient in others (Stata, shell scripts). The schema commits to CSV throughout to stay readable in any rectangular-data tool.

Where a value naturally has structured form (the value set {code: label} dict), value_labels.csv provides both a JSON column and a companion pre-formatted string column. Redundant on purpose — consumers pick whichever fits.

Why `scope` (not `context`, not `register`) — and why separate columns per level

"Register" is SCB-inflected. "Context" is overloaded in the LLM era. "Scope" is unambiguous, matches programming semantics (variable scope = the range where a declaration applies), and reads naturally in filter syntax.

Scope levels are stored as separate columns rather than a single slash-separated string because:

No delimiter collision. Agency names can contain slashes (e.g. Hälso-/sjukvårdsregistret). Separate columns eliminate sanitization.
Each level is independently queryable. keep if scope_level_1 == "LISA" works without substring parsing.
Native argument passing. autolabel variables, scope("LISA" "Individer 16 år och äldre") — users pass levels as separate quoted strings.
Each level can have its own alias. scope_level_1_alias enables short-code lookups without conflating alias semantics with the hierarchical path.

Why `release` (not `version`, not `period`)

"Version" implies semver. "Period" implies continuous time intervals — but SCB's six observed release patterns include calendar dates, content-tagged editions, and future pre-announced events. "Release" is the universal data-publishing term with no temporal presumption.

Why scope and release are separate axes

Scope is identity. kon in LISA and kon in STATIV are different conceptual variables — different labels, different value sets, different meaning.

Release is temporal. kon in LISA 2005 and kon in LISA 2006 are the same conceptual variable across editions.

This asymmetry is what enables release_sets.csv to compress variable rows. For SCB, the junction-table design compresses what would otherwise be ~521,000 (variable × atom) flat rows into ~80,230 variables.csv rows + ~81,693 junction rows.

Why integer foreign keys throughout

No text duplication. Referential integrity enforceable. Integer-join performance.

Why dynamic scope depth

Fixed 2-level scope would lock the format to a Nordic register-data taxonomy. Future agencies with different hierarchies (3-level, 5-level, or differently-named 2-level) would require a breaking schema bump. Dynamic depth declared in the manifest is agency-neutral and future-proof.

Worked example — `kon` in LISA

kon (gender) is stable across LISA 1990–2009 with value set {"1":"Man","2":"Kvinna"}.

`scb_scope_eng.csv` (20 atomic rows, one per year)

scope_id;scope_level_1;scope_level_1_alias;scope_level_2;release;…
1001;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1990";…
1002;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1991";…
…
1020;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"2009";…

`scb_variables_eng.csv` (1 row — fully compressed via both FKs)

variable_name;variable_label;…;datatype;value_label_id;release_set_id
kon;Gender;…;char(1);1;42

`scb_value_labels_eng.csv` (1 row with both JSON and Stata columns)

value_label_id;variable_name;value_labels_json;value_labels_stata;code_count
1;kon;"{""1"":""Man"",""2"":""Kvinna""}";"""1"" ""Man"" ""2"" ""Kvinna""";2

`scb_release_sets_eng.csv` (20 junction rows for `release_set_id=42`)

release_set_id;scope_id
42;1001
42;1002
42;1003
…
42;1020

Stata query for "kon in LISA 2005"

import delimited using "scb_scope_eng.csv", clear delimiters(";")
keep if scope_level_1_alias == "LISA" & release == "2005"
levelsof scope_id, local(rv)   // gets 1015

import delimited using "scb_release_sets_eng.csv", clear delimiters(";")
keep if scope_id == `rv'
levelsof release_set_id, local(rs)   // gets 42

import delimited using "scb_variables_eng.csv", clear delimiters(";")
keep if variable_name == "kon" & release_set_id == `rs'
* apply label from value_labels_stata column

No JSON parsing anywhere.

Migration from the pre-finalized format

Older bundles (3 files: variables + value_labels + registers with inline register/variant/versions columns and pipe-delimited versions) are obsolete. Schema v2 replaces them with the 5-file normalized layout described here.

Pre-finalized layout	Schema v2
3 files per language	5 files per language (adds manifest + release_sets)
`register_id` FK on variables	`release_set_id` FK (junction table)
Pipe-delimited `versions` column	Atomic rows in scope; junction table in release_sets
`register` + `variant` columns	`scope_level_N` separate columns (dynamic depth)
`version` column	`release` opaque string
No manifest	Manifest CSV per language per domain

The RegiStream catalog pipeline (DuckDB) runs a one-pass migration into the v2 layout.

Institutional metadata

The schema is agency-neutral. Any institution can run autolabel against its own register or survey data by placing a schema v2 bundle at ~/.registream/autolabel/<yourdomain>/ and calling autolabel variables, domain(yourdomain) lang(eng). Minimum viable is the two core files — variables.csv and value_labels.csv. The three augmentation files (scope.csv, release_sets.csv, manifest.csv) enable per-scope and per-release precision when you need it.

No internet access required. No registration, no coordination with registream.org — the domain name is local to the institution. Private domains sit side-by-side with the public ones on disk; the tool does not distinguish.

For the full walkthrough — directory layout, distribution patterns (shared network path, git repo, sysadmin deployment), hybrid public+private setups — see Private domains & institutional setup.

What the schema is

CSV conventions

Two hierarchical axes

Core vs. augmentation files

1. Manifest — {domain}_manifest_{lang}.csv

Columns

Required keys

Example — SCB English manifest

Example — SSB English manifest (different naming)

2. Scope — {domain}_scope_{lang}.csv

Columns

Rules

Release axis rules

3. Variables — {domain}_variables_{lang}.csv

Columns

Rules

4. Value labels — {domain}_value_labels_{lang}.csv

Columns

Two equivalent views

Content-hashing rule

Match by FK, not by name

5. Release sets — {domain}_release_sets_{lang}.csv

Columns

Rules

Referential integrity

schema_version — wire-format gate

Multilingual convention

Extension namespacing

Design rationale

Why CSV-only on-disk with companion columns for structured content

Why scope (not context, not register) — and why separate columns per level

Why release (not version, not period)

Why scope and release are separate axes

Why integer foreign keys throughout

Why dynamic scope depth

Worked example — kon in LISA

scb_scope_eng.csv (20 atomic rows, one per year)

scb_variables_eng.csv (1 row — fully compressed via both FKs)

scb_value_labels_eng.csv (1 row with both JSON and Stata columns)

scb_release_sets_eng.csv (20 junction rows for release_set_id=42)

Stata query for "kon in LISA 2005"

Migration from the pre-finalized format

Institutional metadata

Further reading

1. Manifest — `{domain}_manifest_{lang}.csv`

2. Scope — `{domain}_scope_{lang}.csv`

3. Variables — `{domain}_variables_{lang}.csv`

4. Value labels — `{domain}_value_labels_{lang}.csv`

5. Release sets — `{domain}_release_sets_{lang}.csv`

`schema_version` — wire-format gate

Why `scope` (not `context`, not `register`) — and why separate columns per level

Why `release` (not `version`, not `period`)

Worked example — `kon` in LISA

`scb_scope_eng.csv` (20 atomic rows, one per year)

`scb_variables_eng.csv` (1 row — fully compressed via both FKs)

`scb_value_labels_eng.csv` (1 row with both JSON and Stata columns)

`scb_release_sets_eng.csv` (20 junction rows for `release_set_id=42`)