autolabel · schema
Autolabel schema v2
The data format autolabel reads. Five normalized CSV files per catalog domain, with two hierarchical axes (scope & release) and integer foreign keys throughout.
What the schema is
An autolabel-compatible bundle contains 5 CSV files per language for each catalog domain (e.g. scb, dst, ssb):
| File | Role |
|---|---|
{domain}_manifest_{lang}.csv | Manifest (config): schema version, scope hierarchy names, publisher info |
{domain}_scope_{lang}.csv | Atomic manifest of (scope, release) instances |
{domain}_variables_{lang}.csv | Central facts table — one row per stable variable-metadata era, two integer FKs |
{domain}_value_labels_{lang}.csv | Content-hashed value sets, both JSON and Stata-native format columns |
{domain}_release_sets_{lang}.csv | Junction table linking release sets to scope atoms |
All data files are pure rectangular CSVs with scalar columns. value_labels.csv carries two equivalent views of each value set — a JSON column and a pre-formatted Stata string — so consumers pick whichever fits their tool. All cross-file references are integer foreign keys. No inline text duplication. No pipe-delimited multi-value cells.
CSV conventions
- Delimiter: semicolon (
;) — avoids quoting issues with JSON content and Scandinavian text - Quoting: double-quote (
") with inner quotes doubled ("") - Encoding: UTF-8
- No pipe-separated multi-value cells — all cells are scalar
Two hierarchical axes
Every variable's identity is defined by two axes:
-
scope— what slice of reality the variable describes. Stored as separate columnsscope_level_1,scope_level_2, etc. (e.g.scope_level_1 = LISA,scope_level_2 = Individer_16_och_äldre). Depth declared per domain in the manifest. -
release— which published release instance. Opaque string (e.g.2005,Höstterminen_2020,2019_lone_formansutbetalningar).
Scope depth is dynamic — declared per domain in the manifest. SCB uses 2 levels ("Register" / "Variant"); SSB uses 2 differently-named levels ("Source" / "Group"); a future agency could use 4. autolabel reads the manifest at runtime to drive UI labels and browse hierarchy — no code changes needed.
Core vs. augmentation files
The five files split into two tiers:
- Core (required):
variables.csv+value_labels.csv. A consumer can label every variable from these two files alone, using a majority-label rule. - Augmentation (recommended):
scope.csv+release_sets.csv+manifest.csv. The precision layer — enables scope/release filtering and browse hierarchy.
A conformant consumer MUST work with just the core files (graceful degradation). A conformant producer SHOULD emit all five for full precision.
1. Manifest — {domain}_manifest_{lang}.csv
Flat key-value CSV declaring domain metadata and scope hierarchy semantics. One file per language (titles are localized).
Columns
| Column | Type | Required | Description |
|---|---|---|---|
key | string | yes | Configuration key |
value | string | yes | Value for the key |
Required keys
domain— catalog domain identifier (e.g.scb)schema_version— wire-format version (currently2.0)publisher— publishing organization namebundle_release_date— ISO 8601 datelanguages— pipe-separated list (e.g.swe|eng)scope_depth— integer N, the hierarchy depth for this domainscope_level_1_name…scope_level_N_name— short machine-readable level namescope_level_1_title…scope_level_N_title— human-readable level title in this file's language
Example — SCB English manifest
key;value
domain;scb
schema_version;2.0
publisher;Statistics Sweden (SCB)
bundle_release_date;2026-04-16
languages;swe|eng
scope_depth;2
scope_level_1_name;register
scope_level_1_title;Register
scope_level_2_name;variant
scope_level_2_title;VariantExample — SSB English manifest (different naming)
key;value
domain;ssb
schema_version;2.0
publisher;Statistics Norway (SSB)
bundle_release_date;2026-04-20
languages;nob|eng
scope_depth;2
scope_level_1_name;source
scope_level_1_title;Source
scope_level_2_name;group
scope_level_2_title;Group2. Scope — {domain}_scope_{lang}.csv
Atomic manifest. One row per (scope, release) combination.
Columns
| Column | Type | Required | Description |
|---|---|---|---|
scope_id | integer | yes (PK) | Primary key, sequentially assigned during ingestion |
scope_level_1 | string | yes | Full name of scope level 1 (e.g. register name) |
scope_level_1_alias | string | no | Optional short alias (e.g. LISA) |
scope_level_1_description | string | no | Free-text description |
scope_level_2 | string | conditional | Second-level name. Required when scope_depth >= 2 |
scope_level_2_alias | string | no | Optional short alias for level 2 |
scope_level_2_description | string | no | Free-text description of level 2 |
| … | … | … | Additional scope_level_N / _alias / _description for N = scope_depth |
release | string | yes | Atomic release identifier — opaque string |
release_description | string | no | Description specific to this release |
population_date | string | no | ISO 8601 date |
measurement_info | string | no | Free-text description of data collection |
Rules
- Primary key:
scope_id - Uniqueness:
(scope_level_1, ..., scope_level_N, release)unique within a domain - Every row MUST have
scope_level_1populated. Deeper levels MAY be empty per-row - All columns are scalar strings or integers. No JSON columns. No slash-delimited compound values
Release axis rules
releaseis an opaque string — year (2005), academic term (Höstterminen_2020-Vårterminen_2021), calendar date (2014-10-15), quarter (2005_Q1), content-tagged edition, etc.- Each atomic release is one row. No pipe-delimited multi-release strings. Non-contiguous sequences (e.g. election years) get one row per year; gap years are absent.
- No range strings like
1990-2009. Use 20 rows, one per atomic year.
3. Variables — {domain}_variables_{lang}.csv
Central facts table. One row per unique variable-metadata era.
Columns
| Column | Type | Required | Description |
|---|---|---|---|
variable_name | string | yes | Column name in the raw data |
variable_label | string | yes | Human-readable label |
variable_definition | string | no | Full definition text |
variable_unit | string | no | Unit of measurement |
variable_type | enum | yes | categorical / continuous / text / date / identifier |
variable_description | string | no | Long-form description |
variable_source | string | no | Source authority |
variable_external_comment | string | no | External-facing comment |
datatype | string | no | Storage type (char(1), int, float, …) |
value_label_id | integer | no | FK → value_labels.csv (null for non-categorical) |
release_set_id | integer | yes | FK → release_sets.csv (via junction) |
Rules
- Primary key:
(variable_name, datatype, value_label_id, release_set_id) - No scope columns. Scope is resolved via the FK chain:
release_set_id → release_sets → scope_id → scope.scope_level_N - Metadata drift handling: when any variable-level attribute changes across releases, the variable splits into multiple rows, each pointing at its own
release_set_idcovering exactly the releases where that metadata applies
4. Value labels — {domain}_value_labels_{lang}.csv
Content-hashed lookup of unique value sets.
Columns
| Column | Type | Required | Description |
|---|---|---|---|
value_label_id | integer | yes (PK) | Content-hash-derived primary key |
variable_name | string | no | Representative variable (informational only) |
value_labels_json | string | yes | JSON dict {"1":"Man","2":"Kvinna"} |
value_labels_stata | string | yes | Pre-formatted: "1" "Man" "2" "Kvinna" — direct input to Stata's label define |
code_count | integer | yes | Number of distinct codes |
Two equivalent views
Tools with native JSON parsing (Python pandas, R, DuckDB, JavaScript) MAY parse value_labels_json. Tools without convenient JSON parsing (Stata, shell pipelines, spreadsheet apps) MAY use value_labels_stata. Both encode the same value set.
Content-hashing rule
value_label_id is a stable integer hash over the canonical-language (English) normalized value set. Swedish and English files share the same IDs for conceptually identical sets.
Match by FK, not by name
Look up the variable in variables.csv to get value_label_id, then look up that ID in value_labels.csv. Do NOT match on variable_name (it's informational only).
5. Release sets — {domain}_release_sets_{lang}.csv
Junction table. One row per (release_set_id, scope_id) pair.
Columns
| Column | Type | Required | Description |
|---|---|---|---|
release_set_id | integer | yes | Content-hash-derived ID for the release set |
scope_id | integer | yes | FK → scope.scope_id |
Rules
- Primary key: composite
(release_set_id, scope_id) - Content-hashing:
release_set_idis a stable integer hash over the sorted ascending list ofscope_ids in the set. All variables.csv rows that apply to exactly the same set of scope atoms share the samerelease_set_id - Scope invariant: within a single
release_set_id, all referencedscope_ids MUST resolve to the same scope (identicalscope_level_*values)
Referential integrity
A conformant bundle MUST satisfy:
- Every non-null
value_label_idinvariables.csvexists invalue_labels.csv - Every
release_set_idinvariables.csvhas at least one row inrelease_sets.csv - Every
scope_idinrelease_sets.csvexists inscope.csv - Within each
release_set_id, all referencedscope_ids resolve to the same scope (identicalscope_level_*values) - Every
scope.csvrow hasscope_level_1populated; deeper levels may be empty per-row - The manifest file exists for every language listed in its
languageskey
schema_version — wire-format gate
Every autolabel-readable dataset carries a schema_version marker:
- Stata:
char _dta[schema_version] "2.0" - Python / pandas:
df.attrs['schema_version'] = '2.0' - R / haven:
attr(df, 'schema_version') <- '2.0'
This is a runtime compatibility check — autolabel reads it to decide whether an installed tool version can parse the bundle. A future breaking change (e.g. required column added) would bump schema_version to 3.0, and the tool-version gate would reject the new bundle until the tool upgrades.
Multilingual convention
Bundles ship one set of 5 files per language:
{domain}_manifest_{lang}.csv
{domain}_scope_{lang}.csv
{domain}_variables_{lang}.csv
{domain}_value_labels_{lang}.csv
{domain}_release_sets_{lang}.csvEnglish is REQUIRED at the field level. Native-language files are REQUIRED where the source is non-English. Additional languages are OPTIONAL.
IDs are canonical across language files. scope_id, value_label_id, and release_set_id MUST be identical across all language variants. Only localized text columns differ.
Manifest titles are localized per file. Level names (lowercase, machine-readable) are canonical and MUST match across language files; titles (human-readable) differ per language.
Extension namespacing
Producers MAY add extension columns to any data file or extension keys to the manifest. Extension names MUST use a colon-prefix matching the catalog domain:
scb:ext_ansvarig_enhet(column)scb:ext_internal_notes(manifest key)dst:ext_hq_flaghealthdcat:ext_access_rights
Consumer tools MUST ignore unrecognized namespaced columns and keys. This enables non-breaking extension without bumping schema_version.
Design rationale
Why CSV-only on-disk with companion columns for structured content
CSV is the lowest common denominator of tabular-data interchange. Every statistical package, every programming language, every data pipeline reads rectangular CSV natively. JSON is convenient in some environments (Python, JavaScript) and inconvenient in others (Stata, shell scripts). The schema commits to CSV throughout to stay readable in any rectangular-data tool.
Where a value naturally has structured form (the value set {code: label} dict), value_labels.csv provides both a JSON column and a companion pre-formatted string column. Redundant on purpose — consumers pick whichever fits.
Why scope (not context, not register) — and why separate columns per level
"Register" is SCB-inflected. "Context" is overloaded in the LLM era. "Scope" is unambiguous, matches programming semantics (variable scope = the range where a declaration applies), and reads naturally in filter syntax.
Scope levels are stored as separate columns rather than a single slash-separated string because:
- No delimiter collision. Agency names can contain slashes (e.g.
Hälso-/sjukvårdsregistret). Separate columns eliminate sanitization. - Each level is independently queryable.
keep if scope_level_1 == "LISA"works without substring parsing. - Native argument passing.
autolabel variables, scope("LISA" "Individer 16 år och äldre")— users pass levels as separate quoted strings. - Each level can have its own alias.
scope_level_1_aliasenables short-code lookups without conflating alias semantics with the hierarchical path.
Why release (not version, not period)
"Version" implies semver. "Period" implies continuous time intervals — but SCB's six observed release patterns include calendar dates, content-tagged editions, and future pre-announced events. "Release" is the universal data-publishing term with no temporal presumption.
Why scope and release are separate axes
Scope is identity. kon in LISA and kon in STATIV are different conceptual variables — different labels, different value sets, different meaning.
Release is temporal. kon in LISA 2005 and kon in LISA 2006 are the same conceptual variable across editions.
This asymmetry is what enables release_sets.csv to compress variable rows. For SCB, the junction-table design compresses what would otherwise be ~521,000 (variable × atom) flat rows into ~80,230 variables.csv rows + ~81,693 junction rows.
Why integer foreign keys throughout
No text duplication. Referential integrity enforceable. Integer-join performance.
Why dynamic scope depth
Fixed 2-level scope would lock the format to a Nordic register-data taxonomy. Future agencies with different hierarchies (3-level, 5-level, or differently-named 2-level) would require a breaking schema bump. Dynamic depth declared in the manifest is agency-neutral and future-proof.
Worked example — kon in LISA
kon (gender) is stable across LISA 1990–2009 with value set {"1":"Man","2":"Kvinna"}.
scb_scope_eng.csv (20 atomic rows, one per year)
scope_id;scope_level_1;scope_level_1_alias;scope_level_2;release;…
1001;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1990";…
1002;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"1991";…
…
1020;"Longitudinell integrationsdatabas…";"LISA";"Individer 16 år och äldre";"2009";…scb_variables_eng.csv (1 row — fully compressed via both FKs)
variable_name;variable_label;…;datatype;value_label_id;release_set_id
kon;Gender;…;char(1);1;42scb_value_labels_eng.csv (1 row with both JSON and Stata columns)
value_label_id;variable_name;value_labels_json;value_labels_stata;code_count
1;kon;"{""1"":""Man"",""2"":""Kvinna""}";"""1"" ""Man"" ""2"" ""Kvinna""";2scb_release_sets_eng.csv (20 junction rows for release_set_id=42)
release_set_id;scope_id
42;1001
42;1002
42;1003
…
42;1020Stata query for "kon in LISA 2005"
import delimited using "scb_scope_eng.csv", clear delimiters(";")
keep if scope_level_1_alias == "LISA" & release == "2005"
levelsof scope_id, local(rv) // gets 1015
import delimited using "scb_release_sets_eng.csv", clear delimiters(";")
keep if scope_id == `rv'
levelsof release_set_id, local(rs) // gets 42
import delimited using "scb_variables_eng.csv", clear delimiters(";")
keep if variable_name == "kon" & release_set_id == `rs'
* apply label from value_labels_stata columnNo JSON parsing anywhere.
Migration from the pre-finalized format
Older bundles (3 files: variables + value_labels + registers with inline register/variant/versions columns and pipe-delimited versions) are obsolete. Schema v2 replaces them with the 5-file normalized layout described here.
| Pre-finalized layout | Schema v2 |
|---|---|
| 3 files per language | 5 files per language (adds manifest + release_sets) |
register_id FK on variables | release_set_id FK (junction table) |
Pipe-delimited versions column | Atomic rows in scope; junction table in release_sets |
register + variant columns | scope_level_N separate columns (dynamic depth) |
version column | release opaque string |
| No manifest | Manifest CSV per language per domain |
The RegiStream catalog pipeline (DuckDB) runs a one-pass migration into the v2 layout.
Institutional metadata
The schema is agency-neutral. Any institution can run autolabel against its own register or survey data by placing a schema v2 bundle at ~/.registream/autolabel/<yourdomain>/ and calling autolabel variables, domain(yourdomain) lang(eng). Minimum viable is the two core files — variables.csv and value_labels.csv. The three augmentation files (scope.csv, release_sets.csv, manifest.csv) enable per-scope and per-release precision when you need it.
No internet access required. No registration, no coordination with registream.org — the domain name is local to the institution. Private domains sit side-by-side with the public ones on disk; the tool does not distinguish.
For the full walkthrough — directory layout, distribution patterns (shared network path, git repo, sysadmin deployment), hybrid public+private setups — see Private domains & institutional setup.
Further reading
- autolabel Stata reference — how the tool reads this schema
- Catalog — bundles available for download
- Private domains & institutional setup — build your own bundle
- Markdown source of this page — in the
autolabelrepo