autolabel R reference

Install

Requires R 4.1 or later. Haven is pulled in as a dependency.

From registream.org (current, pre-CRAN)

install.packages(
  c("registream", "autolabel"),
  repos = c("https://registream.org/r/",
            "https://cloud.r-project.org/"),
  type = "source"
)

The second repos entry is the fallback for transitive dependencies (curl, digest, jsonlite, haven) that resolve from CRAN as usual. Source install only — both packages are pure R, no compiled code.

CRAN (coming)

CRAN submission is the next step. Once accepted:

install.packages(c("registream", "autolabel"))

Full install notes — secure environments, institutional domains, cache directory override, and the first-run wizard — are on the install guide.

Quick start

library(autolabel)
library(haven)

df <- read_dta("lisa_2020.dta")

# Apply variable and value labels from SCB metadata (English), auto-infer scope
df <- df |> autolabel(domain = "scb", lang = "eng")

# Display-time view (factor columns for quick inspection / summary)
df |> rs_lab() |> head()

Labels land on each column as haven_labelled attributes so they round-trip through write_dta() and play nicely with the tidyverse. See Labeling rules for what happens under the hood when no scope is pinned.

Public API

Labeling

autolabel(df,
          domain       = "scb",
          lang         = "eng",
          scope        = NULL,
          release      = NULL,
          label_type   = c("both", "variables", "values"),
          variables    = NULL,
          exclude      = NULL,
          include_unit = TRUE,
          dryrun       = FALSE,
          directory    = NULL)

S3 method on data.frame. Tibbles dispatch transparently since they inherit from data.frame. Returns the labelled data frame. Pass exclude = c("var1", "var2") to skip specific columns. Pass dryrun = TRUE to return an autolabel_dryrun object describing the planned changes without mutating the data frame.

Inspection

rs_lookup(variables,
          domain    = "scb",
          lang      = "eng",
          scope     = NULL,
          release   = NULL,
          detail    = FALSE,
          directory = NULL)

scope(...,
      domain    = "scb",
      lang      = "eng",
      search    = NULL,
      release   = NULL,
      directory = NULL)

suggest(df, ...,
        domain    = "scb",
        lang      = "eng",
        release   = NULL,
        directory = NULL)

scope() and suggest() take scope-level tokens positionally via ..., matching Stata's autolabel scope LISA Individer syntax. A bare vector works equivalently: scope(c("LISA", "Individer")) and scope("LISA", "Individer") produce the same output.

Display

rs_lab(x)             # haven::as_factor(x) wrapper for full df or single column
rs_lab_head(x, n = 5) # head(rs_lab(x), n)

Maintenance

rs_update_datasets(domain, lang, version = "latest", force = FALSE, directory = NULL)

info()         # config + cache snapshot
cite()         # full citation block (matches `autolabel cite` in Stata)
cite_bibtex()  # versioned BibTeX entry

Core package (registream)

rs_info()                   # config + environment snapshot
rs_first_run(force = FALSE) # run the setup wizard manually
rs_update()                 # prints / runs install.packages() for the latest version (interactive)
rs_stats()                  # local usage statistics (from usage_r.csv)
rs_cite()                   # full citation block (matches `registream cite` in Stata)

Description

autolabel applies variable and value labels from the RegiStream metadata catalog to R data frames. On first use, it downloads and caches a 5-file metadata bundle (manifest + variables + value_labels + scope + release_sets) from registream.org, matches your columns against the catalog, and attaches labels as column attributes.

RegiStream hosts metadata for government statistical agencies including Statistics Sweden (scb), Statistics Denmark (dst), Statistics Norway (ssb), Försäkringskassan (fk), Socialstyrelsen (sos), and Statistics Iceland (hagstofa). Institutions can create metadata files for their own data — see Institutional metadata.

Autolabel schema v2 is depth-agnostic: each domain's manifest declares the scope depth and per-level names. For SCB the two scope levels are Register (e.g. LISA) and Variant (e.g. Individer 16 år och äldre). For SSB the levels are Source and Group. Every variable can appear in multiple scopes with different labels, value definitions, and releases; when no scope is specified, autolabel automatically infers the best-matching scope by analyzing the columns in your data frame.

First-run setup: the first call into any autolabel / registream function triggers a short setup wizard asking you to choose a mode (Offline, Standard, or Full). This governs whether metadata downloads automatically and whether usage data is collected. Change later with rs_info() to inspect, or by editing ~/.registream/config_r.toml. Non-interactive sessions silently pick Offline mode — so R CMD check never hits the network.

Arguments — Required

`df`

A data.frame or tibble whose columns will receive labels. Non-matching columns are left unchanged (pre-existing labels preserved).

`domain`

The metadata domain. Default "scb". Other shipped domains: "dst", "ssb", "hagstofa", "fk", "sos". Institutions can register custom domains; see Institutional metadata.

`lang`

The language for labels. Default "eng". Availability varies per domain — for "scb": "eng" or "swe".

Arguments — Filtering

`scope`

A character vector of scope-level tokens. One token per scope level of the domain's manifest. For SCB (depth 2):

scope = "LISA" — matches all sub-scopes under LISA
scope = c("LISA", "Individer 16 år och äldre") — matches that specific scope atom

Matching per level uses a 3-step ladder, case- and apostrophe-insensitive: (1) exact alias match, (2) exact name match, (3) name substring match.

Overflow shorthand: pass scope_depth + 1 tokens and the last one is promoted to release. For SCB: scope = c("LISA", "Individer 16 år och äldre", "2021") is equivalent to scope = c("LISA", "Individer 16 år och äldre"), release = "2021". This matches Stata's overflow rule and keeps pin commands terse.

When omitted, autolabel infers the best-matching scope by counting dataset-variable overlap per scope and picks the scope with the most matches; ≥10% overlap triggers a primary-scope preference in the collapse, otherwise the dataset is treated as mixed-panel.

`release`

A single release identifier (typically a year, e.g. "2021"). Filters scope rows to that release before labelling — useful when value-label sets change over time (municipality codes, education classifications).

Arguments — Output

`label_type`

One of "both" (default), "variables", or "values". Controls whether variable labels, value labels, or both are applied.

`variables`

Optional character vector restricting which columns receive labels. Names not in df are silently skipped. Useful for labelling a subset of columns when multiple scopes are involved.

`include_unit`

If TRUE (default), append " (unit)" to variable labels when variable_unit is non-empty in the metadata. Matches the Stata behavior.

`directory`

Override the RegiStream cache directory. When omitted, the path resolves from the REGISTREAM_DIR environment variable, then the cache_dir field in config_r.toml, then tools::R_user_dir("registream", "cache"). Set REGISTREAM_DIR to share a cache with the Stata and Python clients on the same machine.

Labeling rules

When you run autolabel(df, ...), R needs to decide which metadata row to use for each column. The rule is identical to the Stata client — both the automatic and the explicit-pin paths.

Automatic mode (no pin)

Automatic mode is what you get when you call autolabel(df, domain = "scb", lang = "eng") with no scope or release argument. Two steps:

Primary scope inference. For each scope tuple in the bundle, infer_scope() counts distinct variable-name matches against colnames(df) (case-insensitive), ranks by match count → coverage → scope tuple ascending, and reports the winner:
```
Auto-detected scope: LISA / Individer 16 år och äldre (91% variable match)
```
When the top scope covers fewer than 10% of columns, has_strong_primary is FALSE — the dataset is treated as mixed-panel with no dominant source.
Per-variable collapse with majority fallback. collapse_to_one_per_variable() joins variables through release_sets to scope, computes _label_freq (count of rows sharing the same (variable_name, variable_label)), and sorts per variable on:
1. _is_primary descending (primary-scope rows first)
2. _label_freq descending (most-common label next)
3. scope_level_1 .. N ascending (deterministic tie-break)
Dedup on variable_name keeps the first row. Every column with any metadata entry gets a label; the primary-scope preference is a sort-key bias, not a filter. Columns not in the primary scope fall through to the majority label across their candidate scopes.

Explicit-pin mode

When you pass scope (and optionally release), autolabel skips inference and narrows the bundle to the pinned subset before collapse. Rows outside the pin are never considered.

df |> autolabel(scope = c("LISA", "Individer 16 år och äldre"),
                release = "2021")

Label-wipe guard: columns in df that have no row in the pinned scope are skipped — their pre-existing attributes are preserved. Chain multiple explicit-pin calls for multi-scope panels without clobbering earlier labels.

Multi-scope panels

Two options when your panel mixes variables from multiple scopes:

Option A — automatic (simple): one call, no scope arg. Primary scope is inferred; non-primary columns fall through to the majority-label rule. One line, everything gets labeled.

Option B — explicit-pin per subset (reproducible): run suggest() first to preview coverage per scope, then pin each subset:

df |> autolabel(
        variables = c("lopnr", "kon", "alder", "kommun"),
        scope     = c("LISA", "Individer 16 år och äldre")
      ) |>
      autolabel(
        variables = c("cfarnr", "bransch", "anstallda"),
        scope     = "Företagsregister"
      )

The label-wipe guard makes this safe: the second call only touches the columns it has metadata for; the first call's labels on LISA-only variables survive.

scope() browser

scope() is the depth-agnostic catalog browser — the R analogue of Stata's autolabel scope subcommand. Returns a plain data frame you can chain with base R, dplyr, gt, or anything else.

Four modes

scope()                                    # level-1 browse with variable_count
scope("LISA")                              # drill to level 2 under LISA
scope("LISA", "Individer 16 år och äldre")         # releases at this scope atom
scope("LISA", "Individer 16 år och äldre", "2021") # variables at this (scope, release) atom
scope(search = "lisa")                     # filtered level-1 browse

The overflow-token shorthand (mode 4) is the terse way to list variables at an atom without writing release = explicitly.

Filtered drill

scope("LISA", search = "individer")

Narrows the level-2 values under LISA to those whose name contains "individer" (case- and apostrophe-insensitive).

suggest()

suggest(df, ...) analyzes the columns of df against the domain metadata and reports which scopes would contribute labels under automatic mode — without mutating df. Recommended first step for mixed-panel workflows.

result <- df |> suggest(domain = "scb", lang = "eng")
print(result)
result$coverage          # data frame: scope_level_1..N, matches, coverage_pct, is_primary
result$primary           # rs_scope_inference or NULL
result$pin_command       # copy-pasteable autolabel(...) call

The coverage table shows per-scope hit counts sorted descending. The inferred primary (when match ≥ 10%) is marked is_primary = TRUE. The pin_command is a copy-paste-ready autolabel(df, scope = ..., ...) call that would reproduce the inferred pin.

Scope-specific preview

df |> suggest("LISA")
df |> suggest("LISA", "Individer 16 år och äldre")

Same ... token syntax as scope(). Pinning narrows the coverage calc to the requested atom so you see exactly what would label under that specific scope.

Under the hood. suggest() runs the same filter_bundle() and inference that autolabel() uses — so what you see is what you'll get.

haven_labelled storage

R has native column attributes — we use them directly, no df-level blob, no labelled-package dependency.

Variable labels: attr(col, "label") — a single string per column.
Value labels: attr(col, "labels") — a named vector (names = labels, values = codes).
Class wrapper: class(col) <- c("haven_labelled", "vctrs_vctr", typeof(col)) on columns with value labels, via haven::labelled().

This is the canonical haven convention. The underlying data is never modified:

df <- haven::read_dta("lisa_2020.dta") |> autolabel()

# Inspect attributes — the integer column itself is unchanged
attr(df$sex, "label")    # "Gender"
attr(df$sex, "labels")   # c(Man = 1L, Kvinna = 2L)
typeof(df$sex)           # "integer"

# Strip labels back to a plain integer
vec <- haven::zap_labels(df$sex)
attr(vec, "label")       # NULL

After labelling, the data frame also carries a small attr(df, "registream") list recording domain, lang, scope, release, scope_depth, and schema_version for introspection by info() and suggest(). This is the only df-level attribute; column attrs remain the source of truth for labels.

Examples — Basic labeling

df |> autolabel()

Label all columns using SCB metadata in English (domain/lang defaults). Scope auto-inferred.

df |> autolabel(domain = "scb", lang = "swe", label_type = "values")

Apply value labels only, in Swedish.

df |> autolabel(variables = c("kon", "alder", "yrkarbtyp"))

Label a specific subset of columns.

df |> autolabel(include_unit = FALSE)

Don't append unit suffixes like " (years)" — use the raw variable_label text.

Examples — Scope-specific

df |> autolabel(scope = "LISA", release = "2005")

Apply labels specific to LISA for the 2005 release. For instance, kon receives "Gender" (from LISA), not "Gender of child" (from Barnregistret).

df |> autolabel(scope = c("LISA", "Individer 16 år och äldre"))

Apply labels from a specific (scope_level_1, scope_level_2) combination.

df |> autolabel(scope = c("LISA", "Individer 16 år och äldre", "2021"))

Overflow shorthand: three tokens when scope_depth = 2; the last becomes the release.

Examples — Lookup

rs_lookup(c("kon", "kommun"))

Display metadata for kon and kommun across all scopes.

rs_lookup("kon", detail = TRUE)

Show every scope level and release entry for kon, with scope_level_1..N and release columns attached.

rs_lookup("kon", scope = c("LISA", "Individer 16 år och äldre"))

Scope-filtered lookup — only rows in that atom.

Examples — Browse the catalog

scope()

Browse top-level scopes; one row per distinct scope_level_1 with a variable_count column.

scope("LISA")

Drill into LISA — one row per sub-scope (variant) with variable counts.

scope("LISA", "Individer 16 år och äldre")

Show releases for that specific scope atom.

scope("LISA", "Individer 16 år och äldre", release = "2021")

Show variables in a specific (scope, release) atom.

scope(search = "lisa")

Filtered level-1 browse — only rows whose scope_level_1 matches "lisa" (case- and apostrophe-insensitive).

Examples — Display-time factor labels

After labelling, the raw integer / character codes are preserved on each column. For display — summary tables, quick inspection, ggplot — convert to factors with rs_lab():

df |> autolabel() |> rs_lab() |> head()

One-line labelled view. rs_lab() is a thin wrapper around haven::as_factor() applied across all haven_labelled columns.

df |>
  autolabel() |>
  rs_lab() |>
  dplyr::count(kon, sort = TRUE)

Use the labelled factors downstream without losing the underlying codes on the original data frame.

Examples — Dataset updates

rs_update_datasets("scb", "eng")

Check for and download the latest SCB English metadata bundle.

rs_update_datasets("scb", "eng", force = TRUE)

Force re-download even if the local cache is current. Useful after an integrity check has flagged a corrupt bundle.

check_for_dataset_updates("scb", "eng")

Non-blocking "is there an update available?" check — returns a short message string and updates the registry's last_checked timestamp.

tidyverse interop

autolabel's output plays well with the tidyverse because it uses the same haven_labelled storage haven itself produces when reading a DTA:

dplyr: filter / mutate / summarise operate on the underlying codes; column attributes ride along.
gtsummary: reads attr(col, "label") for variable labels and attr(col, "labels") for value labels out of the box.
modelsummary, broom.helpers, ggstats: same story — any package that understands haven_labelled will pick up autolabel's output.
haven::write_dta: round-trips losslessly — write an autolabelled data frame to DTA and it reads back with the same labels in Stata.

library(dplyr)
library(gtsummary)

df |>
  autolabel(scope = c("LISA", "Individer 16 år och äldre")) |>
  select(kon, alder, kommun) |>
  tbl_summary()  # picks up autolabel's labels automatically

Institutional metadata

Any institution can create metadata for use with autolabel. Useful for organizations with in-house registers, administrative records, or survey datasets that have standardized variable definitions.

Requirements

Create five semicolon-delimited CSV files following the autolabel schema v2:

manifest_{lang}.csv — key-value manifest declaring domain metadata, scope depth, and per-level names
scope_{lang}.csv — atomic scope-release rows with scope_level_1, scope_level_1_alias, etc.
variables_{lang}.csv — variable names, labels, type, value_label_id, and release_set_id foreign keys
value_labels_{lang}.csv — value label mappings in both JSON and Stata format
release_sets_{lang}.csv — junction table linking release sets to scope atoms

See the schema v2 reference for the full specification.

Installation

Place the CSV files in ~/.registream/autolabel/{domain}/ (or the output of registream::autolabel_cache_dir()). No internet access is required — files are read directly from disk.

df |> autolabel(domain = "yourdomain", lang = "yourlang")

Secure environments

For secure environments (MONA, DST Forskermaskinen, SSB Dapla, etc.): an authorized person copies the CSV files onto the secure server and sets REGISTREAM_DIR to the shared location. No runtime network access is needed. The R client reads the same on-disk format as Stata and Python, so a single cache works for all three.

Authors

Jeffrey Clark

PhD Student, Economics

Stockholm University

Jie Wen

PhD Student, Business Administration

Stockholm School of Economics