autolabel Python reference

Docs
autolabel
Python

Install

Requires Python 3.11 or later. Pandas is pulled in as a dependency.

From PyPI

pip install registream
pip install registream-autolabel

registream-autolabel depends on registream-core (the shared config, metadata cache, and schema validator). Installing the meta-package registream pulls both in one step — mirrors the Stata net install registream convention.

Optional plotting integration. seaborn is not a hard dependency. Install it separately (pip install seaborn) to light up the label-aware plot wrappers described under Jupyter, matplotlib, seaborn. Without seaborn, every other feature still works.

Full install notes — secure environments, institutional domains, cache directory override, and the first-run wizard — are on the install guide.

Quick start

import pandas as pd
import registream.autolabel  # side effect: installs autolabel methods on pd.DataFrame

df = pd.read_stata("lisa_2020.dta")

# Apply variable and value labels from SCB metadata (English); scope auto-inferred.
df.autolabel(domain="scb", lang="eng")

# Display-time labeled view without mutating df.
df.lab.head()

Labels land on df.attrs['registream'] as a structured dict — the column data is never mutated. Importing registream.autolabel adds autolabel methods (df.autolabel, df.lookup, df.lab, and a handful of editors) directly to pd.DataFrame, so the API reads like a native pandas method. That matches the Stata surface (autolabel variables, domain(scb) lang(eng)) and the R surface (df |> autolabel(domain = "scb", lang = "eng")) verb-for-verb; the subject is always your data.

Public API

DataFrame methods (the canonical surface)

df.autolabel(
    domain: str = "scb",
    lang: str = "eng",
    *,
    scope: list[str] | tuple[str, ...] | None = None,
    release: str | None = None,
    label_type: Literal["both", "variables", "values"] = "both",
    variables: list[str] | None = None,
    exclude: list[str] | None = None,
    include_unit: bool = True,
    dryrun: bool = False,
    directory: Path | str | None = None,
) -> pd.DataFrame | DryRunResult

Applies labels from the catalog to df.attrs. Returns the mutated DataFrame, or a DryRunResult when dryrun=True.

Inspection

df.lookup(
    variables: str | list[str],
    *,
    domain: str = "scb",
    lang: str = "eng",
    scope: list[str] | tuple[str, ...] | None = None,
    release: str | None = None,
    detail: bool = False,
    directory: Path | str | None = None,
) -> LookupResult

df.lab                 # LabeledView — value codes → labels at display time
df.lab.head(n=5)
df.lab.as_dataframe()

df.variable_labels()                    # dict
df.value_labels()                       # dict
df.get_variable_labels(columns=None)    # None → dict; str → str|None; list → filtered dict
df.get_value_labels(columns=None)

Label editors (mutate `df.attrs`)

df.set_variable_labels(labels, label=None)                 # str, list, or dict form
df.set_value_labels(columns, value_labels=None, *, overwrite=False)
df.copy_labels(source, target)
df.meta_search(pattern, columns=None, invert=False)

Module-level functions (no df, non-mutating, or result objects)

from registream.autolabel import suggest, scope, info, cite, update_datasets

suggest(df, *, domain="scb", lang="eng", scope=None, release=None, directory=None)
    -> SuggestResult                    # preview coverage without mutating df

scope(domain="scb", lang="eng", *, search=None, scope=None, release=None,
      directory=None) -> pd.DataFrame   # catalog browser, no df required

update_datasets(domain, lang, *, version="latest", force=False, directory=None)
    -> DownloadResult                   # refresh the on-disk metadata bundle

info()                                  # dict: config + cache + versions
cite()                                  # full citation block (matches `autolabel cite` in Stata)

Core package (registream)

from registream.citation import cite, cite_bibtex   # full cite block / versioned BibTeX
from registream.info import info                    # config + environment snapshot
from registream.metadata import load_bundle         # low-level 5-file bundle loader
from registream.schema import Manifest, SchemaError, SchemaVersionError

Description

autolabel applies variable and value labels from the RegiStream metadata catalog to pandas DataFrames. On first use, it downloads and caches a 5-file metadata bundle (manifest + variables + value_labels + scope + release_sets) from registream.org, matches your columns against the catalog, and attaches labels to df.attrs['registream'].

RegiStream hosts metadata for government statistical agencies including Statistics Sweden (scb), Statistics Denmark (dst), Statistics Norway (ssb), Försäkringskassan (fk), Socialstyrelsen (sos), and Statistics Iceland (hagstofa). Institutions can create metadata files for their own data — see Institutional metadata.

Autolabel schema v2 is depth-agnostic: each domain's manifest declares the scope depth and per-level names. For SCB the two scope levels are Register (e.g. LISA) and Variant (e.g. Individer 16 år och äldre). For SSB the levels are Source and Group. Every variable can appear in multiple scopes with different labels, value definitions, and releases; when no scope is specified, autolabel automatically infers the best-matching scope by analyzing the columns in your DataFrame.

First-run setup: the first call into any autolabel / registream function triggers a short setup wizard asking you to choose a mode (Offline, Standard, or Full). This governs whether metadata downloads automatically and whether usage data is collected. Change later by inspecting registream.autolabel.info() or editing ~/.registream/config.toml. Non-interactive sessions silently pick Offline mode — so pytest never hits the network.

Arguments — Required

`df`

A pandas.DataFrame whose columns will receive labels. Non-matching columns are left unchanged (pre-existing labels in df.attrs preserved).

`domain`

The metadata domain. Default "scb". Other shipped domains: "dst", "ssb", "hagstofa", "fk", "sos". Institutions can register custom domains; see Institutional metadata.

`lang`

The language for labels. Default "eng". Availability varies per domain — for "scb": "eng" or "swe".

Arguments — Filtering

`scope`

A list (or tuple) of scope-level tokens. One token per scope level of the domain's manifest. For SCB (depth 2):

scope=["LISA"] — matches all sub-scopes under LISA
scope=["LISA", "Individer 16 år och äldre"] — matches that specific scope atom

Matching per level uses a 3-step ladder, case- and apostrophe-insensitive: (1) exact alias match (scope_level_N_alias), (2) exact name match (scope_level_N), (3) name substring match.

Overflow shorthand: pass scope_depth + 1 tokens and the last one is promoted to release. For SCB: scope=["LISA", "Individer 16 år och äldre", "2021"] is equivalent to scope=["LISA", "Individer 16 år och äldre"], release="2021".

When omitted, autolabel infers the best-matching scope by counting dataset-column overlap per scope and picks the scope with the most matches; ≥ 10% overlap triggers a primary-scope preference in the collapse, otherwise the dataset is treated as mixed-panel.

`release`

A single release identifier (typically a year, e.g. "2021"). Filters scope rows to that release before labeling — useful when value-label sets change over time (municipality codes, education classifications).

Arguments — Output

`label_type`

One of "both" (default), "variables", or "values". Controls whether variable labels, value labels, or both are applied.

`variables`

Optional list restricting which columns receive labels. Names not in df are silently skipped.

`exclude`

Optional list of column names to skip. Applied after variables; mirrors the Stata exclude() option.

`include_unit`

If True (default), append " (unit)" to variable labels when variable_unit is non-empty in the metadata. Matches the Stata behavior.

`dryrun`

If True, return a DryRunResult describing what would be stamped without mutating df.attrs. Use this in notebook workflows to preview the labeling plan before committing.

`directory`

Override the RegiStream cache directory. When omitted, the path resolves from the REGISTREAM_DIR environment variable, then the cache_dir field in config.toml, then the platform-appropriate user data directory. Set REGISTREAM_DIR to share a cache with the Stata and R clients on the same machine.

Labeling rules

When you run df.rs.autolabel(...), Python needs to decide which metadata row to use for each column. The rule is identical to the Stata and R clients — both the automatic and the explicit-pin paths.

Automatic mode (no pin)

Automatic mode is what you get when you call df.autolabel(domain="scb", lang="eng") with no scope or release. Two steps:

Primary scope inference. For each scope tuple in the bundle, infer_scope() counts distinct variable-name matches against df.columns (case-insensitive), ranks by match count → coverage → scope tuple ascending, and picks the winner. When the top scope covers fewer than 10 % of columns, has_strong_primary is False — the dataset is treated as mixed-panel with no dominant source.
Per-variable collapse with majority fallback. collapse_to_one_per_variable() joins variables through release_sets to scope, computes _label_freq (count of rows sharing the same (variable_name, variable_label)), and sorts per variable on:
1. _is_primary descending (primary-scope rows first)
2. _label_freq descending (most-common label next)
3. scope_level_1 .. N ascending (deterministic tie-break)
Dedup on variable_name keeps the first row. Every column with any metadata entry gets a label; the primary-scope preference is a sort-key bias, not a filter. Columns not in the primary scope fall through to the majority label across their candidate scopes.

Explicit-pin mode

When you pass scope= (and optionally release=), autolabel skips inference and narrows the bundle to the pinned subset before collapse. Rows outside the pin are never considered.

df.autolabel(
    scope=["LISA", "Individer 16 år och äldre"],
    release="2021",
)

Label-wipe guard: columns in df that have no row in the pinned scope are skipped — their pre-existing df.attrs entries are preserved. Chain multiple explicit-pin calls for multi-scope panels without clobbering earlier labels.

Multi-scope panels

Two options when your panel mixes variables from multiple scopes:

Option A — automatic (simple): one call, no scope arg. Primary scope is inferred; non-primary columns fall through to the majority-label rule. One line, everything gets labeled.

Option B — explicit-pin per subset (reproducible): run suggest() first to preview coverage per scope, then pin each subset:

df.autolabel(
    variables=["lopnr", "kon", "alder", "kommun"],
    scope=["LISA", "Individer 16 år och äldre"],
)
df.autolabel(
    variables=["cfarnr", "bransch", "anstallda"],
    scope=["Företagsregister"],
)

The label-wipe guard makes this safe: the second call only touches the columns it has metadata for; the first call's labels on LISA-only variables survive.

scope() browser

scope() is the depth-agnostic catalog browser — the Python analogue of Stata's autolabel scope subcommand. Returns a plain DataFrame you can chain with any pandas op.

Four modes

from registream.autolabel import scope

scope()                                                       # level-1 browse with variable_count
scope(scope=["LISA"])                                         # drill to level 2 under LISA
scope(scope=["LISA", "Individer 16 år och äldre"])            # releases at this scope atom
scope(scope=["LISA", "Individer 16 år och äldre", "2021"])    # variables at this (scope, release) atom
scope(search="lisa")                                          # filtered level-1 browse

The overflow-token shorthand (mode 4) is the terse way to list variables at an atom without writing release= explicitly. The scope= argument is keyword-only; pass a list of tokens in order (level 1 to level N).

Filtered drill

scope(scope=["LISA"], search="individer")

Narrows the level-2 values under LISA to those whose name contains "individer" (case- and apostrophe-insensitive).

suggest()

suggest(df, ...) analyzes the columns of df against the domain metadata and reports which scopes would contribute labels under automatic mode — without mutating df. Recommended first step for mixed-panel workflows.

from registream.autolabel import suggest

result = suggest(df, domain="scb", lang="eng")
result                         # Jupyter: header + coverage table
result.coverage                # DataFrame: scope_level_1..N, matches, coverage_pct, is_primary
result.primary                 # ScopeInference | None
result.pin_command             # "df.autolabel(domain='scb', lang='eng', scope=[...])"

The coverage table shows per-scope hit counts sorted descending. The inferred primary (when match ≥ 10 %) is marked is_primary=True. The pin_command is a copy-paste-ready df.autolabel(...) call that reproduces the inferred pin.

Scope-specific preview

from registream.autolabel import suggest

suggest(df, scope=["LISA"])
suggest(df, scope=["LISA", "Individer 16 år och äldre"])

Pinning narrows the coverage calc to the requested atom so you see exactly what would label under that specific scope.

Under the hood. suggest() runs the same filter_bundle() and inference that autolabel() uses — so what you see is what you'll get.

`df.attrs['registream']` storage

Python doesn't have R's haven_labelled column-attribute convention; instead pandas has df.attrs, a dict-valued slot on every DataFrame for arbitrary metadata. autolabel uses it directly:

df.attrs["registream"] = {
    "variable_labels": {"kon": "Sex", "alder": "Age (years)", ...},
    "value_labels":    {"kon": {1: "Man", 2: "Woman"}, ...},
    "domain":          "scb",
    "lang":            "eng",
    "scope":           ("LISA", "Individer 16 år och äldre"),
    "release":         "2021",
    "scope_depth":     2,
}
df.attrs["schema_version"] = "2.0"

The underlying column data is never modified:

df.autolabel(domain="scb", lang="eng")

df.attrs["registream"]["variable_labels"]["kon"]  # "Sex"
df.attrs["registream"]["value_labels"]["kon"]     # {1: "Man", 2: "Woman"}
df["kon"].dtype                                   # int64 — still raw integer codes

What survives pandas ops

Preserved: df.copy(); most chain operations that keep the same object identity (e.g. df.assign(...), df.merge(...) on one side, indexing that returns the same DataFrame).
Propagated automatically: df["new"] = df["old"] and df.rename(columns=...) — see pandas-method patches. Labels follow the column.
Lost (file I/O): df.to_parquet, df.to_csv, plain df.to_stata. For Stata export that retains value labels, cast the relevant columns to pd.Categorical (the value-label mapping gives the categories).

Read the full dict directly, or use the accessor getters for a fresh copy:

df.rs.variable_labels()        # {"kon": "Sex", ...}
df.rs.value_labels()           # {"kon": {1: "Man", 2: "Woman"}, ...}
df.get_variable_labels("kon")  # "Sex"
df.get_variable_labels(["kon", "alder"])  # filtered dict

stamp_registream_attrs(df, domain, lang, scope, release, scope_depth) is the function that writes the provenance fields; autolabel() calls it internally. Import it from registream.autolabel if you're hand-building a labeled DataFrame (rare).

Return-type dataclasses

Four small dataclasses wrap rich results so you keep attribute access (.missing, .pin_command) without losing Jupyter-friendly rendering.

`SuggestResult` — returned by `suggest()`

result.coverage       # DataFrame: scope_level_1..N, matches, coverage_pct, is_primary
result.primary        # ScopeInference | None — the inferred primary (≥10% coverage)
result.pin_command    # str — copy-pasteable df.autolabel(...) call

Renders in Jupyter as a header line + the coverage table; plain repr() in a REPL.

`LookupResult` — returned by `df.rs.lookup()`

result.df             # DataFrame: full metadata rows (one per match in default mode)
result.missing        # list[str] — variable names requested but not found
result.scope_counts   # dict[str, int] — how many distinct scopes each variable appears in

Jupyter renders a curated column subset (variable_name, variable_label, variable_type, variable_unit, value_labels, scope_level_*, release) plus the missing list. Use .df for all columns.

`DryRunResult` — returned when `autolabel(dryrun=True)`

plan = df.autolabel(dryrun=True)

plan.variable_labels    # {column: label} — what would be stamped
plan.value_labels       # {column: {code: label}}
plan.skipped_vars       # list — columns that matched but had an empty label
plan.resolved_scope     # tuple | None — the scope that would be recorded
plan.resolved_release   # str | None

df.attrs is untouched. Apply the plan by calling autolabel() again without dryrun=True.

`LabeledView` — returned by `df.lab`

df.lab               # wraps df without mutating it
df.lab.head()        # first 5 rows, value codes substituted
df.lab.tail()
df.lab.sample(10)
df.lab.as_dataframe()  # extract as a regular DataFrame
df.lab[["kon", "alder"]]  # labeled columns (Series named after variable labels)

Substitution uses df.attrs['registream']['value_labels']. Codes without a mapping pass through unchanged. The wrapped DataFrame and its columns are not modified — use a LabeledView for display; use the original df for math.

`DownloadResult` — returned by `update_datasets()`

result.domain   # "scb"
result.lang     # "eng"
result.files    # list[str] — filenames written
result.skipped  # list[str] — already-cached keys
result.failed   # list[tuple[str, str]] — (key, reason) per failure

Examples — Basic labeling

df.autolabel()

Label all columns using SCB metadata in English (domain/lang defaults). Scope auto-inferred.

df.autolabel(domain="scb", lang="swe", label_type="values")

Apply value labels only, in Swedish.

df.autolabel(variables=["kon", "alder", "yrkarbtyp"])

Label a specific subset of columns. Names not in df are silently skipped.

df.autolabel(exclude=["temp_col", "debug_flag"])

Label every matching column except a few. Applied after variables.

df.autolabel(include_unit=False)

Don't append unit suffixes like " (years)" — use the raw variable_label text.

Examples — Scope-specific

df.autolabel(scope=["LISA"], release="2005")

Apply labels specific to LISA for the 2005 release. For instance, kon receives "Gender" (from LISA), not "Gender of child" (from Barnregistret).

df.autolabel(scope=["LISA", "Individer 16 år och äldre"])

Apply labels from a specific (scope_level_1, scope_level_2) combination.

df.autolabel(scope=["LISA", "Individer 16 år och äldre", "2021"])

Overflow shorthand: three tokens when scope_depth=2; the last becomes the release.

Examples — Lookup

result = df.lookup(["kon", "kommun"])
result.df           # metadata rows for kon, kommun across all scopes
result.missing      # [] — nothing missing
result.scope_counts # {"kon": 5, "kommun": 3}

Display metadata for kon and kommun. Jupyter renders the curated table + missing-variables list.

df.lookup("kon", detail=True)

Show every scope level and release entry for kon, with scope_level_1..N and release columns attached.

df.lookup("kon", scope=["LISA", "Individer 16 år och äldre"])

Scope-filtered lookup — only rows in that atom.

Examples — Browse the catalog

from registream.autolabel import scope

scope()

Browse top-level scopes; one row per distinct scope_level_1 with a variable_count column.

scope(scope=["LISA"])

Drill into LISA — one row per sub-scope (variant) with variable counts.

scope(scope=["LISA", "Individer 16 år och äldre"])

Show releases for that specific scope atom.

scope(scope=["LISA", "Individer 16 år och äldre"], release="2021")

Show variables in a specific (scope, release) atom.

scope(search="lisa")

Filtered level-1 browse — only rows whose scope_level_1 matches "lisa" (case- and apostrophe-insensitive).

Examples — Dry-run + `suggest()`

from registream.autolabel import suggest

result = suggest(df)
print(result.pin_command)       # copy-paste-ready df.autolabel(...) call
result.coverage.head()          # per-scope hit counts

Run suggest() first on a mixed-panel dataset to see which scope would dominate under automatic mode and grab a reproducible pin command.

plan = df.autolabel(dryrun=True)
len(plan.variable_labels), len(plan.value_labels), plan.resolved_scope

dryrun=True returns a DryRunResult: same resolution logic as a real call, but df.attrs is untouched. Useful for auditing which columns would get labels before committing.

Examples — Display-time with `df.lab`

After labeling, the raw integer / string codes are preserved on each column. For display — summary tables, quick inspection, plots — use df.lab for a labeled view:

df.autolabel()
df.lab.head()

One-line labeled view. Value codes substituted; the original df is unchanged.

(
    df.lab
      .sample(1000)
      .groupby("kon")["alder"].mean()
)

Group by labeled categories without mutating the underlying DataFrame.

df.lab.as_dataframe()    # extract as a regular pandas DataFrame with substituted values

Materialize the labeled view when you need a plain DataFrame for an API that doesn't accept the wrapper.

Examples — Dataset updates

from registream.autolabel import update_datasets

update_datasets("scb", "eng")

Check for and download the latest SCB English metadata bundle.

update_datasets("scb", "eng", force=True)

Force re-download even if the local cache is current. Useful after an integrity check has flagged a corrupt bundle.

from registream.autolabel import check_for_dataset_updates

msg = check_for_dataset_updates("scb", "eng")
print(msg or "up to date")

Non-blocking "is there an update available?" check — returns a short message string when an update is pending, or an empty string otherwise.

Jupyter, matplotlib, seaborn

autolabel ships three integrations that light up automatically on import. Every integration is idempotent and has an opt-out environment variable; every integration is conditional on df.attrs['registream'] being present, so unlabeled DataFrames are completely untouched.

Jupyter — rich `_repr_html_` on every result type

SuggestResult, LookupResult, and the underlying DataFrame all render as proper HTML tables in Jupyter. No extra setup required — if pandas renders as a table in your notebook, so will these.

Seaborn — label-aware plot wrappers

On import, autolabel wraps 16 seaborn plotting functions (when seaborn is installed): scatterplot, lineplot, barplot, boxplot, violinplot, stripplot, swarmplot, countplot, histplot, kdeplot, ecdfplot, heatmap, relplot, lmplot, regplot, residplot. When you pass a labeled DataFrame (or df.lab) via data=, the wrapper:

Substitutes value labels on categorical axes (skipping the x-axis on scatterplot / lineplot to preserve numeric scales);
Applies variable labels to axis titles and legend titles;
Leaves the underlying data untouched.

import seaborn as sns
import registream.autolabel  # wraps sns functions on import

df.autolabel(domain="scb", lang="eng")

sns.barplot(data=df, x="kon", y="alder")
# x-axis ticks: "Man", "Woman"
# x label:     "Sex"
# y label:     "Age (years)"

Opt out with REGISTREAM_NO_PLOT_PATCH=1 before importing. The patch is a no-op when seaborn is not installed.

Pandas patches — labels follow columns through `setitem` and `rename`

autolabel installs two pandas method patches so labels survive the most common column-level operations:

df["new"] = df["old"] copies the label entry for old onto new.
df.rename(columns={"old": "new"}) remaps the label dict keys accordingly.

Both patches are conditional on df.attrs['registream'] being present — unlabeled DataFrames are left exactly as pandas would render them. Opt out with REGISTREAM_NO_PANDAS_PATCH=1 before importing.

df.autolabel()
df["age_years"] = df["alder"]          # label "Age (years)" follows
df = df.rename(columns={"kon": "sex"}) # label dict remapped to "sex"

All three opt-outs read environment variables at import time. Set the flag before import registream.autolabel runs; toggling after import has no effect. To restore default behavior, restart the Python process.

Institutional metadata

Any institution can create metadata for use with autolabel. Useful for organizations with in-house registers, administrative records, or survey datasets that have standardized variable definitions.

Requirements

Create five semicolon-delimited CSV files following the autolabel schema v2:

manifest_{lang}.csv — key-value manifest declaring domain metadata, scope depth, and per-level names
scope_{lang}.csv — atomic scope-release rows with scope_level_1, scope_level_1_alias, etc.
variables_{lang}.csv — variable names, labels, type, value_label_id, and release_set_id foreign keys
value_labels_{lang}.csv — value label mappings in both JSON and Stata format
release_sets_{lang}.csv — junction table linking release sets to scope atoms

See the schema v2 reference for the full specification.

Installation

Place the CSV files in ~/.registream/autolabel/{domain}/ (or the directory returned by registream.metadata.cache_dir()). No internet access is required — files are read directly from disk.

df.autolabel(domain="yourdomain", lang="yourlang")

Secure environments

For secure environments (MONA, DST Forskermaskinen, SSB Dapla, etc.): an authorized person copies the CSV files onto the secure server and sets REGISTREAM_DIR to the shared location. No runtime network access is needed. The Python client reads the same on-disk format as Stata and R, so a single cache works for all three.

Accessor — implementation note

The DataFrame methods shown above are thin forwarders to a pandas accessor registered under df.rs. This is the standard pandas extension mechanism (pd.api.extensions.register_dataframe_accessor). Both forms resolve to the same code path:

df.autolabel(...)   # method form — documented surface
df.rs.autolabel(...)  # accessor form — same call, same result

The accessor namespace (df.rs.autolabel, df.rs.lookup, df.rs.lab, df.rs.variable_labels(), df.rs.value_labels()) exists for two reasons: (1) it's the canonical pandas extension pattern so downstream tools that introspect DataFrames can discover our methods without relying on name-specific monkey-patches; (2) it gives library authors who opt out of the method form a deterministic namespace to target.

Library authors: set REGISTREAM_NO_SHORTCUTS=1 before importing registream.autolabel to skip the monkey-patch. The accessor (df.rs.*) and the module-level functions (autolabel(df, ...), suggest(df), …) remain fully available. End-user documentation uses the method form throughout; this opt-out exists so library code doesn't surprise users who never imported us directly.

Authors

Jeffrey Clark

PhD Student, Economics

Stockholm University

Jie Wen

PhD Student, Business Administration

Stockholm School of Economics