Institutional setup — RegiStream

Docs
Install
Institutional setup

Who this is for

Individual researchers can skip this page and stick to the install overview. This page is for situations where RegiStream is being deployed for more than one researcher, or where the team has its own data with its own variable conventions:

A research group at a university department wants everyone on the same catalog version, same config, same conventions.
A hospital or administrative agency wants a private metadata domain for their internal registers, so researchers can run autolabel variables, domain(hospital) on in-house data.
A secure-env tenant (a project inside MONA / Forskermaskinen / Dapla) wants to pre-configure autolabel for all members of the project.

Private ≠ secret. "Private metadata domains" means the metadata files live on your disk rather than on registream.org — they don't need to be kept secret, they just don't need to be published. A university department might run a private domain for a year, then decide to publish it; the files are the same either way.

Sysadmin: read-only deployment for all users

The cleanest way to provision RegiStream for many users on a shared system: place the bundle tree in a read-only directory once, and point every user's autolabel client at it via the REGISTREAM_DIR environment variable.

1. Populate the shared directory once

A privileged user (admin / data steward) downloads the bundles and places them in a path readable by all target users:

# As admin, on a machine with internet
sudo mkdir -p /opt/registream/autolabel/scb
# Place the bundle tree at /opt/registream/autolabel/scb/{manifest,scope,variables,value_labels,release_sets}
sudo chmod -R a+rX /opt/registream  # everyone can read; nobody can write

2. Set `REGISTREAM_DIR` globally

Add an env var so every user's autolabel client reads from the shared dir:

# /etc/profile.d/registream.sh  (Linux/Mac, system-wide)
export REGISTREAM_DIR=/opt/registream

For Windows, set REGISTREAM_DIR as a system environment variable.

3. Verify per-user

From any user account on the system:

# Stata
registream info
# Python
python -m registream info
# R
library(autolabel); info()

Each should report the cache directory as /opt/registream. Subsequent autolabel calls read bundles from the shared tree without writing anywhere.

Why this works for read-only. autolabel's normal flow tries to write a per-installation .salt file and a usage log on first use. Those writes target the user's config directory (always per-user), not the cache directory. The bundle tree itself is read-only by design once downloaded; the client never modifies it. So pointing REGISTREAM_DIR at a read-only shared path is safe.

Updating the shared bundle

When a new catalog version ships, the admin replaces the contents of /opt/registream/autolabel/<domain>/ with the new bundle tree. No user-side action is needed; on next call autolabel sees the new bundle and uses it.

Drop a ZIP, skip the unzip step. autolabel also accepts the bundle in its raw catalog form: drop {domain}_{lang}_v{version}.zip directly into $REGISTREAM_DIR/autolabel/ and the first call unzips and processes it in place, no separate ingest step required. See Secure environments → Three ways to stage the bundle for the option list.

Shared configuration

RegiStream reads per-user configuration from a config file in ~/.registream/, one per language client. For team deployments where each researcher has their own writable home directory but you want consistent settings, distribute a template config file.

Config file paths per language

Language	Default config path	Format
Stata	`~/.registream/config_stata.csv` (mac/linux); `~/AppData/Local/registream/config_stata.csv` (Windows)	Semicolon-separated, one row per `key;value` setting
Python	`~/.registream/config_python.toml` (mac/linux); `~/AppData/Local/registream/config_python.toml` (Windows)	TOML (flat key/value)
R	`tools::R_user_dir("registream", "config")/config_r.toml` — CRAN-compliant per OS	TOML (flat key/value)

Setting REGISTREAM_DIR places all three clients' config files in the same directory — on R that overrides the default R_user_dir location and brings R's config in line with Stata and Python.

Option A — a shared config file

Create a template config file and distribute it. Researchers place it at the appropriate path on their own machine. Core reads it on first use, skipping the wizard.

Stata example (~/.registream/config_stata.csv) — semicolon-delimited key;value file. Secure-environment preset (no internet, no telemetry, datamirror floors at agency-strict defaults):

key;value
usage_logging;true
telemetry_enabled;false
internet_access;false
auto_update_check;false
dm_min_cell_size;50
dm_quantile_trim;1

Presence of the usage_logging key is Stata's first-run-done marker — there is no separate first_run_completed field on the Stata side. last_update_check, update_available, and the *_latest_version keys are runtime cache fields the client writes itself; you do not need to pre-populate them.

Python example (~/.registream/config_python.toml):

usage_logging = true
telemetry_enabled = false
internet_access = false
auto_update_check = false
first_run_completed = true

R example (~/.registream/config_r.toml after consent, or tools::R_user_dir("registream","config")/config_r.toml):

usage_logging = true
telemetry_enabled = false
internet_access = false
auto_update_check = false
first_run_completed = true
# Optional — set only if R should share ~/.registream with Stata and Python:
# cache_dir = "~/.registream"

cache_dir is R-only. It is written automatically when a user picks "shared" in the CRAN-mandated second prompt on first run. Include the commented line above only if you want the shared-cache layout pre-baked into the template. Python and R configs do not carry dm_min_cell_size or dm_quantile_trim — datamirror currently ships only for Stata.

Ship the template in your team's internal documentation, version control, or onboarding script, and refresh it when you change a policy. Pre-setting first_run_completed = true (Python, R) or including the usage_logging key (Stata) skips the first-run wizard entirely.

The values shown above match Offline mode: no outbound network, no telemetry, no auto-update checks. Autolabel skips all network calls without prompting; metadata is read entirely from the staged bundle tree under $REGISTREAM_DIR/autolabel/. If your environment does allow outbound HTTPS to registream.org, set internet_access = true and (optionally) auto_update_check = true.

Option B — a shared network cache

For environments with a shared filesystem (HPC clusters, some secure envs), point all clients at a network path by setting the REGISTREAM_DIR environment variable. This is the universal mechanism — Stata, Python, and R all honor it and ignore any per-config override when it is set.

# /etc/profile.d/registream.sh
export REGISTREAM_DIR=/shared/registream

One person's autolabel update datasets populates the cache for everyone; subsequent users hit the cache locally. Works well if the shared path is fast and the team uses a common set of domains. On R, this env var also pre-empts the first-run cache-location prompt.

Inspecting + changing config at runtime

registream info                                    // show current config
registream config, telemetry_enabled(false)        // change one setting
registream config, internet_access(true)           // change another
registream config                                  // no options: show config

Settable keys: usage_logging, telemetry_enabled, internet_access, auto_update_check across all three clients, plus dm_min_cell_size and dm_quantile_trim on Stata only (datamirror is Stata-only at present). Python and R expose the same set via their own API surfaces.

Private metadata domains

A "domain" in autolabel is a named metadata source. Public domains (scb, dst, ssb, fk, sos, hagstofa) are served from registream.org and downloaded on first use. Private domains are defined entirely by files on your disk. No internet, no registration, no coordination with RegiStream.

* Given a private "hospital" domain on disk, this just works:
autolabel variables, domain(hospital) lang(eng)

Use cases:

A hospital has its own patient-register coding conventions; a private domain captures them.
A university survey research group maintains metadata for a panel they run.
An agency has an internal register not yet in the public catalog; a private domain serves it internally.
A research project curates harmonized metadata across a set of public sources; a private "project" domain packages that harmonization.

Building a private domain

A conformant private domain is just a directory with files in the autolabel schema v2 layout. The tool doesn't care whether the files came from our catalog pipeline or from a spreadsheet a researcher exported.

Directory structure

~/.registream/autolabel/hospital/
    manifest/
        hospital_manifest_eng.csv
    scope/
        hospital_scope_eng.csv
    variables/
        hospital_variables_eng.csv
    value_labels/
        hospital_value_labels_eng.csv
    release_sets/
        hospital_release_sets_eng.csv

Minimum-viable private domain (for single-scope data)

If your data is a single dataset with no scope hierarchy and one release, you can ship just the two core files:

variables.csv — one row per variable (omit release_set_id)
value_labels.csv — content-hashed value sets

autolabel falls back to majority-label collapse in that case (which is trivially correct when there's only one scope). See Core vs. augmentation.

Building the files

Most institutions have variable definitions in some existing form — a codebook Excel file, a dictionary DTA, a vendor-supplied PDF. A one-off ETL script generates the 5 CSVs. Typical pattern (pseudocode):

# Python sketch
import pandas as pd
# read your codebook (Excel, CSV, whatever)
codebook = pd.read_excel('internal_codebook.xlsx')

# emit scope.csv (one row per register-variant-release combo you have)
emit_scope_csv(...)

# emit variables.csv (one row per variable metadata era; point at
# release_set_id for the releases where that metadata applies)
emit_variables_csv(...)

# emit value_labels.csv (content-hashed; dedup)
emit_value_labels_csv(...)

# emit release_sets.csv (junction table)
emit_release_sets_csv(...)

# emit manifest.csv (config)
emit_manifest_csv(scope_depth=1, ...)

The schema reference has the full column specs. For a faster path, start from a public domain bundle (e.g. scb_eng_v20260309.zip) as a concrete example and edit from there.

Coming: a helper script that builds private-domain CSVs from a simpler input form (a single spreadsheet or a DTA). Tracked on the roadmap.

Distributing to your team

Once you have a working private domain directory, distributing it to team members is just file copying. Options:

Shared network path — most researchers' ~/.registream/autolabel/hospital/ is a symlink to /shared/registream/autolabel/hospital/. One update, everyone gets it.
Git repo — keep the private domain under version control. Each team member clones and symlinks. Gives you history + review process for metadata changes.
Agency-managed deployment — IT deploys the directory tree into each researcher's home directory on provisioning.

The only expectation the tool has is that the files be at ~/.registream/autolabel/<domain>/. Everything else is your choice.

Hybrid deployments (public + private)

You can mix public and private domains freely. A research group working on Swedish microdata with its own project-specific metadata extensions typically has:

scb — public catalog, updated periodically via autolabel update datasets
team — private project domain living in a shared network path

Each dataset gets labeled with the domain that best matches:

use scb_extract.dta, clear
autolabel variables, domain(scb) lang(eng)

use team_derived_vars.dta, clear
autolabel variables, domain(team) lang(eng)

use merged.dta, clear
autolabel variables base_vars*, domain(scb) lang(eng)
autolabel variables derived*,    domain(team) lang(eng)

autolabel's label-wipe guard (see Labeling rules) ensures chained calls don't overwrite each other — variables in one call's varlist that have no row in that domain are skipped, preserving whatever label the previous call applied.