Overview

RegiStream creates and publishes professionally curated metadata for administrative register datasets. We source raw metadata from statistical agencies and other data sources, then transform it through extensive scraping, harmonization, integrity checks, and quality control processes.

Our Process

  • Data Collection - Systematic scraping and extraction from official sources
  • Harmonization - Standardizing variable names, labels, and value encodings
  • Quality Control - Automated integrity checks and manual verification
  • Multilingual Support - Translation and localization where available
  • Version Control - Dated releases for reproducibility
  • Continuous Improvement - Ongoing updates and coverage expansion

Published Dataset Features

  • Schema-compliant - All datasets follow Schema 1.0 specification
  • Server-hosted - Downloadable on-demand via API
  • Version-controlled - Multiple dated versions available for reproducibility
  • Professionally curated - Extensive quality control and harmonization

Current Coverage: Statistics Sweden (SCB) registers are currently available. We are actively working to expand coverage to additional statistical agencies and data sources.

Schema Versions

RegiStream datasets follow a schema versioning system to ensure compatibility and allow for improvements over time while maintaining backward compatibility.

Overview

Schema 1.0 is the current stable schema version for RegiStream metadata files. All datasets use semicolon-delimited CSV format with UTF-8 encoding.

Schema Specification

Variables File Structure

Variables files describe dataset variables with labels, definitions, units, and types:

Column Type Required Description Example
variable_name string Yes Variable identifier (lowercase) kon, inkomst
variable_label string Yes Short descriptive label Sex, Income
variable_definition string Yes Detailed explanation of the variable Gender of the person. Binary classification.
variable_unit string No Unit of measurement (if applicable) SEK, kg, %
variable_type string Yes Canonical variable type - one of: categorical, continuous, text, date, binary categorical, continuous
value_label_id integer No Link to value labels (if applicable) 1, 42
Allowed Values for variable_type
Value Description Has Value Labels? Examples
categorical Discrete categories Yes Sex, Legal form, Country code
continuous Numeric measurements No Income, Age, Temperature
text Free-form strings No Names, Addresses, Comments
date Temporal data No Birth date, Registration date
binary Boolean/indicators (0/1) Yes Is active, Has children

Value Labels File Structure

Value labels files provide category labels for categorical variables in dual formats:

Column Type Required Description
value_label_id integer Yes Unique identifier linking to variables
variable_name string Yes Variable this label set applies to
value_labels_json string Yes JSON format for Python, R, APIs (e.g., {"K": "Woman", "M": "Man"})
value_labels_stata string Yes Space-separated quoted pairs for Stata parsing (e.g., "K" "Woman" "M" "Man")
conflict integer No Harmonization conflict flag (0/1)
harmonized_automatically integer No Auto-harmonization flag (0/1)

File Format: Schema 1.0 datasets use semicolon-delimited CSV files with UTF-8 encoding.

Chunked Files: Large datasets are split into small files (≤5 MB each) to facilitate manual transfer into secure computing environments. Files are numbered with zero-padded suffixes (e.g., _0000.csv, _0001.csv). The RegiStream autolabel module automatically merges these chunks when loading. If using the files manually, you will need to append them together in numerical order.

Available Datasets

RegiStream publishes curated metadata datasets organized by data source. Each dataset undergoes rigorous quality control, harmonization, and integrity checking before publication.

Expansion Roadmap: We are continuously expanding our coverage to include more statistical agencies and data sources. Each new dataset requires significant development work including scraping infrastructure, harmonization logic, and quality validation.

Variables

Variable files contain metadata describing dataset variables including variable codes, labels, detailed definitions, data types, and units of measurement.

Value Labels

Value label files provide human-readable descriptions for coded categorical values (variables with variable_type = categorical or binary).

Languages

Datasets are available in multiple languages depending on the data source. RegiStream provides metadata in both the original language and English translations where applicable.

Language availability varies by dataset domain. Use the download modal to see which languages are available for each dataset.

Language Codes: We use ISO 639-2 three-letter codes (e.g., eng for English, swe for Swedish).

Downloads

Download RegiStream's published datasets, organized by data source. All datasets are professionally curated, version-controlled, and thoroughly quality-checked.

Currently Published Datasets

Domain Description Languages
Statistics Sweden (SCB) Curated and harmonized metadata from Swedish administrative registers. Includes standardized variable descriptions, value labels, and multilingual support. English Swedish

Note: The "Latest" version is automatically updated when new releases are available. For reproducible research, we recommend using a specific dated version.

Getting Started

RegiStream datasets can be accessed through platform-specific packages that handle downloading, caching, and loading metadata automatically.

Choose Your Platform

Stata

Install the RegiStream package for Stata to automatically download and apply variable labels and value labels to your datasets.

Stata Documentation
Python

Use the RegiStream Python package to access dataset metadata programmatically in your data analysis workflows.

Python Documentation