Install
Institutional setup
For research groups, departments, or secure-env tenants deploying RegiStream at scale: shared configuration, private domains for your institution's own register data, and hybrid deployments.
Who this is for
Individual researchers can skip this page and stick to the install overview. This page is for situations where RegiStream is being deployed for more than one researcher, or where the team has its own data with its own variable conventions:
- A research group at a university department wants everyone on the same catalog version, same config, same conventions.
- A hospital or administrative agency wants a private metadata domain for their internal registers, so researchers can run
autolabel variables, domain(hospital)on in-house data. - A secure-env tenant (a project inside MONA / Forskermaskinen / Dapla) wants to pre-configure autolabel for all members of the project.
Sysadmin: read-only deployment for all users
The cleanest way to provision RegiStream for many users on a shared system: place the bundle tree in a read-only directory once, and point every user's autolabel client at it via the REGISTREAM_DIR environment variable.
1. Populate the shared directory once
A privileged user (admin / data steward) downloads the bundles and places them in a path readable by all target users:
# As admin, on a machine with internet
sudo mkdir -p /opt/registream/autolabel/scb
# Place the bundle tree at /opt/registream/autolabel/scb/{manifest,scope,variables,value_labels,release_sets}
sudo chmod -R a+rX /opt/registream # everyone can read; nobody can write2. Set REGISTREAM_DIR globally
Add an env var so every user's autolabel client reads from the shared dir:
# /etc/profile.d/registream.sh (Linux/Mac, system-wide)
export REGISTREAM_DIR=/opt/registreamFor Windows, set REGISTREAM_DIR as a system environment variable.
3. Verify per-user
From any user account on the system:
# Stata
registream info
# Python
python -m registream info
# R
library(autolabel); info()Each should report the cache directory as /opt/registream. Subsequent autolabel calls read bundles from the shared tree without writing anywhere.
.salt file and a usage log on first use. Those writes target the user's config directory (always per-user), not the cache directory. The bundle tree itself is read-only by design once downloaded; the client never modifies it. So pointing REGISTREAM_DIR at a read-only shared path is safe.
Updating the shared bundle
When a new catalog version ships, the admin replaces the contents of /opt/registream/autolabel/<domain>/ with the new bundle tree. No user-side action is needed; on next call autolabel sees the new bundle and uses it.
{domain}_{lang}_v{version}.zip directly into $REGISTREAM_DIR/autolabel/ and the first call unzips and processes it in place, no separate ingest step required. See Secure environments → Three ways to stage the bundle for the option list.
Private metadata domains
A "domain" in autolabel is a named metadata source. Public domains (scb, dst, ssb, fk, sos, hagstofa) are served from registream.org and downloaded on first use. Private domains are defined entirely by files on your disk. No internet, no registration, no coordination with RegiStream.
* Given a private "hospital" domain on disk, this just works:
autolabel variables, domain(hospital) lang(eng)Use cases:
- A hospital has its own patient-register coding conventions; a private domain captures them.
- A university survey research group maintains metadata for a panel they run.
- An agency has an internal register not yet in the public catalog; a private domain serves it internally.
- A research project curates harmonized metadata across a set of public sources; a private "project" domain packages that harmonization.
Building a private domain
A conformant private domain is just a directory with files in the autolabel schema v2 layout. The tool doesn't care whether the files came from our catalog pipeline or from a spreadsheet a researcher exported.
Directory structure
~/.registream/autolabel/hospital/
manifest/
hospital_manifest_eng.csv
scope/
hospital_scope_eng.csv
variables/
hospital_variables_eng.csv
value_labels/
hospital_value_labels_eng.csv
release_sets/
hospital_release_sets_eng.csvMinimum-viable private domain (for single-scope data)
If your data is a single dataset with no scope hierarchy and one release, you can ship just the two core files:
variables.csv— one row per variable (omitrelease_set_id)value_labels.csv— content-hashed value sets
autolabel falls back to majority-label collapse in that case (which is trivially correct when there's only one scope). See Core vs. augmentation.
Building the files
Most institutions have variable definitions in some existing form — a codebook Excel file, a dictionary DTA, a vendor-supplied PDF. A one-off ETL script generates the 5 CSVs. Typical pattern (pseudocode):
# Python sketch
import pandas as pd
# read your codebook (Excel, CSV, whatever)
codebook = pd.read_excel('internal_codebook.xlsx')
# emit scope.csv (one row per register-variant-release combo you have)
emit_scope_csv(...)
# emit variables.csv (one row per variable metadata era; point at
# release_set_id for the releases where that metadata applies)
emit_variables_csv(...)
# emit value_labels.csv (content-hashed; dedup)
emit_value_labels_csv(...)
# emit release_sets.csv (junction table)
emit_release_sets_csv(...)
# emit manifest.csv (config)
emit_manifest_csv(scope_depth=1, ...)
The schema reference has the full column specs. For a faster path, start from a public domain bundle (e.g. scb_eng_v20260309.zip) as a concrete example and edit from there.
Distributing to your team
Once you have a working private domain directory, distributing it to team members is just file copying. Options:
- Shared network path — most researchers'
~/.registream/autolabel/hospital/is a symlink to/shared/registream/autolabel/hospital/. One update, everyone gets it. - Git repo — keep the private domain under version control. Each team member clones and symlinks. Gives you history + review process for metadata changes.
- Agency-managed deployment — IT deploys the directory tree into each researcher's home directory on provisioning.
The only expectation the tool has is that the files be at ~/.registream/autolabel/<domain>/. Everything else is your choice.
Hybrid deployments (public + private)
You can mix public and private domains freely. A research group working on Swedish microdata with its own project-specific metadata extensions typically has:
scb— public catalog, updated periodically viaautolabel update datasetsteam— private project domain living in a shared network path
Each dataset gets labeled with the domain that best matches:
use scb_extract.dta, clear
autolabel variables, domain(scb) lang(eng)
use team_derived_vars.dta, clear
autolabel variables, domain(team) lang(eng)
use merged.dta, clear
autolabel variables base_vars*, domain(scb) lang(eng)
autolabel variables derived*, domain(team) lang(eng)autolabel's label-wipe guard (see Labeling rules) ensures chained calls don't overwrite each other — variables in one call's varlist that have no row in that domain are skipped, preserving whatever label the previous call applied.
See also
- Install overview
- Secure environments
- Schema v2 reference — for building private domains
- Institutional metadata section of the Stata reference