Overview
RegiStream creates and publishes professionally curated metadata for administrative register datasets. We source raw metadata from statistical agencies and other data sources, then transform it through extensive scraping, harmonization, integrity checks, and quality control processes.
Our Process
- Data Collection - Systematic scraping and extraction from official sources
- Harmonization - Standardizing variable names, labels, and value encodings
- Quality Control - Automated integrity checks and manual verification
- Multilingual Support - Translation and localization where available
- Version Control - Dated releases for reproducibility
- Continuous Improvement - Ongoing updates and coverage expansion
Published Dataset Features
- Schema-compliant - All datasets follow Schema 1.0 specification
- Server-hosted - Downloadable on-demand via API
- Version-controlled - Multiple dated versions available for reproducibility
- Professionally curated - Extensive quality control and harmonization
Current Coverage: Statistics Sweden (SCB) registers are currently available. We are actively working to expand coverage to additional statistical agencies and data sources.
Schema Versions
RegiStream datasets follow a schema versioning system to ensure compatibility and allow for improvements over time while maintaining backward compatibility.
Overview
Schema 1.0 is the current stable schema version for RegiStream metadata files. All datasets use semicolon-delimited CSV format with UTF-8 encoding.
Schema Specification
Variables File Structure
Variables files describe dataset variables with labels, definitions, units, and types:
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
variable_name |
string | Yes | Variable identifier (lowercase) | kon, inkomst |
variable_label |
string | Yes | Short descriptive label | Sex, Income |
variable_definition |
string | Yes | Detailed explanation of the variable | Gender of the person. Binary classification. |
variable_unit |
string | No | Unit of measurement (if applicable) | SEK, kg, % |
variable_type |
string | Yes | Canonical variable type - one of: categorical, continuous, text, date, binary |
categorical, continuous |
value_label_id |
integer | No | Link to value labels (if applicable) | 1, 42 |
Allowed Values for variable_type
| Value | Description | Has Value Labels? | Examples |
|---|---|---|---|
categorical |
Discrete categories | Yes | Sex, Legal form, Country code |
continuous |
Numeric measurements | No | Income, Age, Temperature |
text |
Free-form strings | No | Names, Addresses, Comments |
date |
Temporal data | No | Birth date, Registration date |
binary |
Boolean/indicators (0/1) | Yes | Is active, Has children |
Value Labels File Structure
Value labels files provide category labels for categorical variables in dual formats:
| Column | Type | Required | Description |
|---|---|---|---|
value_label_id |
integer | Yes | Unique identifier linking to variables |
variable_name |
string | Yes | Variable this label set applies to |
value_labels_json |
string | Yes | JSON format for Python, R, APIs (e.g., {"K": "Woman", "M": "Man"}) |
value_labels_stata |
string | Yes | Space-separated quoted pairs for Stata parsing (e.g., "K" "Woman" "M" "Man") |
conflict |
integer | No | Harmonization conflict flag (0/1) |
harmonized_automatically |
integer | No | Auto-harmonization flag (0/1) |
File Format: Schema 1.0 datasets use semicolon-delimited CSV files with UTF-8 encoding.
Chunked Files: Large datasets are split into small files (≤5 MB each) to facilitate manual transfer
into secure computing environments. Files are numbered with zero-padded suffixes (e.g., _0000.csv, _0001.csv).
The RegiStream autolabel module automatically merges these chunks when loading. If using the files manually, you will need to
append them together in numerical order.
Available Datasets
RegiStream publishes curated metadata datasets organized by data source. Each dataset undergoes rigorous quality control, harmonization, and integrity checking before publication.
Expansion Roadmap: We are continuously expanding our coverage to include more statistical agencies and data sources. Each new dataset requires significant development work including scraping infrastructure, harmonization logic, and quality validation.
Variables
Variable files contain metadata describing dataset variables including variable codes, labels, detailed definitions, data types, and units of measurement.
Value Labels
Value label files provide human-readable descriptions for coded categorical values
(variables with variable_type = categorical or binary).
Languages
Datasets are available in multiple languages depending on the data source. RegiStream provides metadata in both the original language and English translations where applicable.
Language availability varies by dataset domain. Use the download modal to see which languages are available for each dataset.
Language Codes: We use ISO 639-2 three-letter codes (e.g., eng for English, swe for Swedish).
Downloads
Download RegiStream's published datasets, organized by data source. All datasets are professionally curated, version-controlled, and thoroughly quality-checked.
Currently Published Datasets
| Domain | Description | Languages | |
|---|---|---|---|
| Statistics Sweden (SCB) | Curated and harmonized metadata from Swedish administrative registers. Includes standardized variable descriptions, value labels, and multilingual support. | English Swedish |
|
Note: The "Latest" version is automatically updated when new releases are available. For reproducible research, we recommend using a specific dated version.
Getting Started
RegiStream datasets can be accessed through platform-specific packages that handle downloading, caching, and loading metadata automatically.
Choose Your Platform
Stata
Install the RegiStream package for Stata to automatically download and apply variable labels and value labels to your datasets.
Stata DocumentationPython
Use the RegiStream Python package to access dataset metadata programmatically in your data analysis workflows.
Python Documentation