1. Docs
  2. autolabel
  3. Schema v2

autolabel · schema

Autolabel schema v2

The data format autolabel reads. Five normalized CSV files per catalog domain, with two hierarchical axes (scope & release) and integer foreign keys throughout.

What the schema is

An autolabel-compatible bundle contains 5 CSV files per language for each catalog domain (e.g. scb, dst, ssb):

FileRole
{domain}_manifest_{lang}.csv Manifest (config): schema version, scope hierarchy names, publisher info
{domain}_scope_{lang}.csv Atomic manifest of (scope, release) instances
{domain}_variables_{lang}.csv Central facts table — one row per stable variable-metadata era, two integer FKs
{domain}_value_labels_{lang}.csv Content-hashed value sets, both JSON and Stata-native format columns
{domain}_release_sets_{lang}.csv Junction table linking release sets to scope atoms

All data files are pure rectangular CSVs with scalar columns. value_labels.csv carries two equivalent views of each value set — a JSON column and a pre-formatted Stata string — so consumers pick whichever fits their tool. All cross-file references are integer foreign keys. No inline text duplication. No pipe-delimited multi-value cells.

CSV conventions

  • Delimiter: semicolon (;) — avoids quoting issues with JSON content and Scandinavian text
  • Quoting: double-quote (") with inner quotes doubled ("")
  • Encoding: UTF-8
  • No pipe-separated multi-value cells — all cells are scalar

Two hierarchical axes

Every variable's identity is defined by two axes:

  • scope — what slice of reality the variable describes. Stored as separate columns scope_level_1, scope_level_2, etc. (e.g. scope_level_1 = LISA, scope_level_2 = Individer_16_och_äldre). Depth declared per domain in the manifest.
  • release — which published release instance. Opaque string (e.g. 2005, Höstterminen_2020, 2019_lone_formansutbetalningar).

Scope depth is dynamic — declared per domain in the manifest. SCB uses 2 levels ("Register" / "Variant"); SSB uses 2 differently-named levels ("Source" / "Group"); a future agency could use 4. autolabel reads the manifest at runtime to drive UI labels and browse hierarchy — no code changes needed.

Core vs. augmentation files

The five files split into two tiers:

  • Core (required): variables.csv + value_labels.csv. A consumer can label every variable from these two files alone, using a majority-label rule.
  • Augmentation (recommended): scope.csv + release_sets.csv + manifest.csv. The precision layer — enables scope/release filtering and browse hierarchy.

A conformant consumer MUST work with just the core files (graceful degradation). A conformant producer SHOULD emit all five for full precision.

1. Manifest — {domain}_manifest_{lang}.csv

Flat key-value CSV declaring domain metadata and scope hierarchy semantics. One file per language (titles are localized).

Columns

ColumnTypeRequiredDescription
key string yes Configuration key
value string yes Value for the key

Required keys

  • domain — catalog domain identifier (e.g. scb)
  • schema_version — wire-format version (currently 2.0)
  • publisher — publishing organization name
  • bundle_release_date — ISO 8601 date
  • languages — pipe-separated list (e.g. swe|eng)
  • scope_depth — integer N, the hierarchy depth for this domain
  • scope_level_1_namescope_level_N_name — short machine-readable level name
  • scope_level_1_titlescope_level_N_title — human-readable level title in this file's language

Example — SCB English manifest

key;value
domain;scb
schema_version;2.0
publisher;Statistics Sweden (SCB)
bundle_release_date;2026-04-16
languages;swe|eng
scope_depth;2
scope_level_1_name;register
scope_level_1_title;Register
scope_level_2_name;variant
scope_level_2_title;Variant

Example — SSB English manifest (different naming)

key;value
domain;ssb
schema_version;2.0
publisher;Statistics Norway (SSB)
bundle_release_date;2026-04-20
languages;nob|eng
scope_depth;2
scope_level_1_name;source
scope_level_1_title;Source
scope_level_2_name;group
scope_level_2_title;Group

2. Scope — {domain}_scope_{lang}.csv

Atomic manifest. One row per (scope, release) combination.

Columns

ColumnTypeRequiredDescription
scope_id integer yes (PK) Primary key, sequentially assigned during ingestion
scope_level_1 string yes Full name of scope level 1 (e.g. register name)
scope_level_1_alias string no Optional short alias (e.g. LISA)
scope_level_1_description string no Free-text description
scope_level_2 string conditional Second-level name. Required when scope_depth >= 2
scope_level_2_alias string no Optional short alias for level 2
scope_level_2_description string no Free-text description of level 2
Additional scope_level_N / _alias / _description for N = scope_depth
release string yes Atomic release identifier — opaque string
release_description string no Description specific to this release
population_date string no ISO 8601 date
measurement_info string no Free-text description of data collection

Rules

  • Primary key: scope_id
  • Uniqueness: (scope_level_1, ..., scope_level_N, release) unique within a domain
  • Every row MUST have scope_level_1 populated. Deeper levels MAY be empty per-row
  • All columns are scalar strings or integers. No JSON columns. No slash-delimited compound values

Release axis rules

  • release is an opaque string — year (2005), academic term (Höstterminen_2020-Vårterminen_2021), calendar date (2014-10-15), quarter (2005_Q1), content-tagged edition, etc.
  • Each atomic release is one row. No pipe-delimited multi-release strings. Non-contiguous sequences (e.g. election years) get one row per year; gap years are absent.
  • No range strings like 1990-2009. Use 20 rows, one per atomic year.

3. Variables — {domain}_variables_{lang}.csv

Central facts table. One row per unique variable-metadata era.

Columns

ColumnTypeRequiredDescription
variable_name string yes Column name in the raw data
variable_label string yes Human-readable label
variable_definition string no Full definition text
variable_unit string no Unit of measurement
variable_type enum yes categorical / continuous / text / date / identifier
variable_description string no Long-form description
variable_source string no Source authority
variable_external_comment string no External-facing comment
datatype string no Storage type (char(1), int, float, …)
value_label_id integer no FK → value_labels.csv (null for non-categorical)
release_set_id integer yes FK → release_sets.csv (via junction)

Rules

  • Primary key: (variable_name, datatype, value_label_id, release_set_id)
  • No scope columns. Scope is resolved via the FK chain: release_set_id → release_sets → scope_id → scope.scope_level_N
  • Metadata drift handling: when any variable-level attribute changes across releases, the variable splits into multiple rows, each pointing at its own release_set_id covering exactly the releases where that metadata applies

4. Value labels — {domain}_value_labels_{lang}.csv

Content-hashed lookup of unique value sets.

Columns

ColumnTypeRequiredDescription
value_label_id integer yes (PK) Content-hash-derived primary key
variable_name string no Representative variable (informational only)
value_labels_json string yes JSON dict {"1":"Man","2":"Kvinna"}
value_labels_stata string yes Pre-formatted: "1" "Man" "2" "Kvinna" — direct input to Stata's label define
code_count integer yes Number of distinct codes

Two equivalent views

Tools with native JSON parsing (Python pandas, R, DuckDB, JavaScript) MAY parse value_labels_json. Tools without convenient JSON parsing (Stata, shell pipelines, spreadsheet apps) MAY use value_labels_stata. Both encode the same value set.

Content-hashing rule

value_label_id is a stable integer hash over the canonical-language (English) normalized value set. Swedish and English files share the same IDs for conceptually identical sets.

Match by FK, not by name

Look up the variable in variables.csv to get value_label_id, then look up that ID in value_labels.csv. Do NOT match on variable_name (it's informational only).

5. Release sets — {domain}_release_sets_{lang}.csv

Junction table. One row per (release_set_id, scope_id) pair.

Columns

ColumnTypeRequiredDescription
release_set_id integer yes Content-hash-derived ID for the release set
scope_id integer yes FK → scope.scope_id

Rules

  • Primary key: composite (release_set_id, scope_id)
  • Content-hashing: release_set_id is a stable integer hash over the sorted ascending list of scope_ids in the set. All variables.csv rows that apply to exactly the same set of scope atoms share the same release_set_id
  • Scope invariant: within a single release_set_id, all referenced scope_ids MUST resolve to the same scope (identical scope_level_* values)

Referential integrity

A conformant bundle MUST satisfy:

  1. Every non-null value_label_id in variables.csv exists in value_labels.csv
  2. Every release_set_id in variables.csv has at least one row in release_sets.csv
  3. Every scope_id in release_sets.csv exists in scope.csv
  4. Within each release_set_id, all referenced scope_ids resolve to the same scope (identical scope_level_* values)
  5. Every scope.csv row has scope_level_1 populated; deeper levels may be empty per-row
  6. The manifest file exists for every language listed in its languages key

schema_version — wire-format gate

Every autolabel-readable dataset carries a schema_version marker:

  • Stata: char _dta[schema_version] "2.0"
  • Python / pandas: df.attrs['schema_version'] = '2.0'
  • R / haven: attr(df, 'schema_version') <- '2.0'

This is a runtime compatibility check — autolabel reads it to decide whether an installed tool version can parse the bundle. A future breaking change (e.g. required column added) would bump schema_version to 3.0, and the tool-version gate would reject the new bundle until the tool upgrades.

Multilingual convention

Bundles ship one set of 5 files per language:

{domain}_manifest_{lang}.csv
{domain}_scope_{lang}.csv
{domain}_variables_{lang}.csv
{domain}_value_labels_{lang}.csv
{domain}_release_sets_{lang}.csv

English is REQUIRED at the field level. Native-language files are REQUIRED where the source is non-English. Additional languages are OPTIONAL.

IDs are canonical across language files. scope_id, value_label_id, and release_set_id MUST be identical across all language variants. Only localized text columns differ.

Manifest titles are localized per file. Level names (lowercase, machine-readable) are canonical and MUST match across language files; titles (human-readable) differ per language.

Extension namespacing

Producers MAY add extension columns to any data file or extension keys to the manifest. Extension names MUST use a colon-prefix matching the catalog domain:

  • scb:ext_ansvarig_enhet (column)
  • scb:ext_internal_notes (manifest key)
  • dst:ext_hq_flag
  • healthdcat:ext_access_rights

Consumer tools MUST ignore unrecognized namespaced columns and keys. This enables non-breaking extension without bumping schema_version.

Design rationale

Why CSV-only on-disk with companion columns for structured content

CSV is the lowest common denominator of tabular-data interchange. Every statistical package, every programming language, every data pipeline reads rectangular CSV natively. JSON is convenient in some environments (Python, JavaScript) and inconvenient in others (Stata, shell scripts). The schema commits to CSV throughout to stay readable in any rectangular-data tool.

Where a value naturally has structured form (the value set {code: label} dict), value_labels.csv provides both a JSON column and a companion pre-formatted string column. Redundant on purpose — consumers pick whichever fits.

Why scope (not context, not register) — and why separate columns per level

"Register" is SCB-inflected. "Context" is overloaded in the LLM era. "Scope" is unambiguous, matches programming semantics (variable scope = the range where a declaration applies), and reads naturally in filter syntax.

Scope levels are stored as separate columns rather than a single slash-separated string because:

  1. No delimiter collision. Agency names can contain slashes (e.g. Hälso-/sjukvårdsregistret). Separate columns eliminate sanitization.
  2. Each level is independently queryable. keep if scope_level_1 == "LISA" works without substring parsing.
  3. Native argument passing. autolabel variables, scope("LISA" "Individer 16 år och äldre") — users pass levels as separate quoted strings.
  4. Each level can have its own alias. scope_level_1_alias enables short-code lookups without conflating alias semantics with the hierarchical path.

Why release (not version, not period)

"Version" implies semver. "Period" implies continuous time intervals — but SCB's six observed release patterns include calendar dates, content-tagged editions, and future pre-announced events. "Release" is the universal data-publishing term with no temporal presumption.

Why scope and release are separate axes

Scope is identity. kon in LISA and kon in STATIV are different conceptual variables — different labels, different value sets, different meaning.

Release is temporal. kon in LISA 2005 and kon in LISA 2006 are the same conceptual variable across editions.

This asymmetry is what enables release_sets.csv to compress variable rows. For SCB, the junction-table design compresses what would otherwise be ~521,000 (variable × atom) flat rows into ~80,230 variables.csv rows + ~81,693 junction rows.

Why integer foreign keys throughout

No text duplication. Referential integrity enforceable. Integer-join performance.

Why dynamic scope depth

Fixed 2-level scope would lock the format to a Nordic register-data taxonomy. Future agencies with different hierarchies (3-level, 5-level, or differently-named 2-level) would require a breaking schema bump. Dynamic depth declared in the manifest is agency-neutral and future-proof.

Worked example — kon in LISA

kon (gender) is stable across LISA 1990–2009 with value set {"1":"Man","2":"Kvinna"}.

scb_scope_eng.csv (20 atomic rows, one per year)

scope_id;scope_level_1;scope_level_1_alias;scope_level_2;release;…
1001;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1990";…
1002;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1991";…
…
1020;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"2009";…

scb_variables_eng.csv (1 row — fully compressed via both FKs)

variable_name;variable_label;…;datatype;value_label_id;release_set_id
kon;Gender;…;char(1);1;42

scb_value_labels_eng.csv (1 row with both JSON and Stata columns)

value_label_id;variable_name;value_labels_json;value_labels_stata;code_count
1;kon;"{""1"":""Man"",""2"":""Kvinna""}";"""1"" ""Man"" ""2"" ""Kvinna""";2

scb_release_sets_eng.csv (20 junction rows for release_set_id=42)

release_set_id;scope_id
42;1001
42;1002
42;1003
…
42;1020

Stata query for "kon in LISA 2005"

import delimited using "scb_scope_eng.csv", clear delimiters(";")
keep if scope_level_1_alias == "LISA" & release == "2005"
levelsof scope_id, local(rv)   // gets 1015

import delimited using "scb_release_sets_eng.csv", clear delimiters(";")
keep if scope_id == `rv'
levelsof release_set_id, local(rs)   // gets 42

import delimited using "scb_variables_eng.csv", clear delimiters(";")
keep if variable_name == "kon" & release_set_id == `rs'
* apply label from value_labels_stata column

No JSON parsing anywhere.

Migration from the pre-finalized format

Older bundles (3 files: variables + value_labels + registers with inline register/variant/versions columns and pipe-delimited versions) are obsolete. Schema v2 replaces them with the 5-file normalized layout described here.

Pre-finalized layoutSchema v2
3 files per language 5 files per language (adds manifest + release_sets)
register_id FK on variables release_set_id FK (junction table)
Pipe-delimited versions column Atomic rows in scope; junction table in release_sets
register + variant columns scope_level_N separate columns (dynamic depth)
version column release opaque string
No manifest Manifest CSV per language per domain

The RegiStream catalog pipeline (DuckDB) runs a one-pass migration into the v2 layout.

Institutional metadata

The schema is agency-neutral. Any institution can run autolabel against its own register or survey data by placing a schema v2 bundle at ~/.registream/autolabel/<yourdomain>/ and calling autolabel variables, domain(yourdomain) lang(eng). Minimum viable is the two core files — variables.csv and value_labels.csv. The three augmentation files (scope.csv, release_sets.csv, manifest.csv) enable per-scope and per-release precision when you need it.

No internet access required. No registration, no coordination with registream.org — the domain name is local to the institution. Private domains sit side-by-side with the public ones on disk; the tool does not distinguish.

For the full walkthrough — directory layout, distribution patterns (shared network path, git repo, sysadmin deployment), hybrid public+private setups — see Private domains & institutional setup.

Further reading