RegiStream

Overview

RegiStream creates and publishes professionally curated metadata for administrative register datasets. We source raw metadata from statistical agencies and other data sources, then transform it through extensive scraping, harmonization, integrity checks, and quality control processes.

Our Process

Data Collection - Systematic scraping and extraction from official sources
Harmonization - Standardizing variable names, labels, and value encodings
Quality Control - Automated integrity checks and manual verification
Multilingual Support - Translation and localization where available
Version Control - Dated releases for reproducibility
Continuous Improvement - Ongoing updates and coverage expansion

Published Dataset Features

Schema-compliant - All datasets follow Schema 1.0 specification
Server-hosted - Downloadable on-demand via API
Version-controlled - Multiple dated versions available for reproducibility
Professionally curated - Extensive quality control and harmonization

Current Coverage: Statistics Sweden (SCB) registers are currently available. We are actively working to expand coverage to additional statistical agencies and data sources.

Schema Versions

RegiStream datasets follow a schema versioning system to ensure compatibility and allow for improvements over time while maintaining backward compatibility.

Overview

Schema 1.0 is the current stable schema version for RegiStream metadata files. All datasets use semicolon-delimited CSV format with UTF-8 encoding.

Schema Specification

Select Schema Version:

Variables File Structure

Variables files describe dataset variables with labels, definitions, units, and types:

Column	Type	Required	Description	Example
`variable_name`	string	Yes	Variable identifier (lowercase)	`kon`, `inkomst`
`variable_label`	string	Yes	Short descriptive label	`Sex`, `Income`
`variable_definition`	string	Yes	Detailed explanation of the variable	`Gender of the person. Binary classification.`
`variable_unit`	string	No	Unit of measurement (if applicable)	`SEK`, `kg`, `%`
`variable_type`	string	Yes	Canonical variable type - one of: `categorical`, `continuous`, `text`, `date`, `binary`	`categorical`, `continuous`
`value_label_id`	integer	No	Link to value labels (if applicable)	`1`, `42`

Allowed Values for `variable_type`

Value	Description	Has Value Labels?	Examples
`categorical`	Discrete categories	Yes	Sex, Legal form, Country code
`continuous`	Numeric measurements	No	Income, Age, Temperature
`text`	Free-form strings	No	Names, Addresses, Comments
`date`	Temporal data	No	Birth date, Registration date
`binary`	Boolean/indicators (0/1)	Yes	Is active, Has children

Value Labels File Structure

Value labels files provide category labels for categorical variables in dual formats:

Column	Type	Required	Description
`value_label_id`	integer	Yes	Unique identifier linking to variables
`variable_name`	string	Yes	Variable this label set applies to
`value_labels_json`	string	Yes	JSON format for Python, R, APIs (e.g., `{"K": "Woman", "M": "Man"}`)
`value_labels_stata`	string	Yes	Space-separated quoted pairs for Stata parsing (e.g., `"K" "Woman" "M" "Man"`)
`conflict`	integer	No	Harmonization conflict flag (0/1)
`harmonized_automatically`	integer	No	Auto-harmonization flag (0/1)

File Format: Schema 1.0 datasets use semicolon-delimited CSV files with UTF-8 encoding.

Chunked Files: Large datasets are split into small files (≤5 MB each) to facilitate manual transfer into secure computing environments. Files are numbered with zero-padded suffixes (e.g., _0000.csv, _0001.csv). The RegiStream autolabel module automatically merges these chunks when loading. If using the files manually, you will need to append them together in numerical order.

Available Datasets

RegiStream publishes curated metadata datasets organized by data source. Each dataset undergoes rigorous quality control, harmonization, and integrity checking before publication.

Expansion Roadmap: We are continuously expanding our coverage to include more statistical agencies and data sources. Each new dataset requires significant development work including scraping infrastructure, harmonization logic, and quality validation.

Variables

Variable files contain metadata describing dataset variables including variable codes, labels, detailed definitions, data types, and units of measurement.

Value Labels

Value label files provide human-readable descriptions for coded categorical values (variables with variable_type = categorical or binary).

Languages

Datasets are available in multiple languages depending on the data source. RegiStream provides metadata in both the original language and English translations where applicable.

Language availability varies by dataset domain. Use the download modal to see which languages are available for each dataset.

Language Codes: We use ISO 639-2 three-letter codes (e.g., eng for English, swe for Swedish).

Downloads

Download RegiStream's published datasets, organized by data source. All datasets are professionally curated, version-controlled, and thoroughly quality-checked.

Currently Published Datasets

Domain	Description	Languages
Statistics Sweden (SCB)	Curated and harmonized metadata from Swedish administrative registers. Includes standardized variable descriptions, value labels, and multilingual support.	English Swedish

Note: The "Latest" version is automatically updated when new releases are available. For reproducible research, we recommend using a specific dated version.

Getting Started

RegiStream datasets can be accessed through platform-specific packages that handle downloading, caching, and loading metadata automatically.

Choose Your Platform

Stata

Install the RegiStream package for Stata to automatically download and apply variable labels and value labels to your datasets.

Stata Documentation

Python

Use the RegiStream Python package to access dataset metadata programmatically in your data analysis workflows.

Python Documentation

Overview

Our Process

Published Dataset Features

Schema Versions

Overview

Schema Specification

Variables File Structure

Allowed Values for variable_type

Value Labels File Structure

Available Datasets

Variables

Value Labels

Languages

Downloads

Currently Published Datasets

Getting Started

Choose Your Platform

Stata

Python

Allowed Values for `variable_type`