Custom Datasets

Overview

RegiStream allows you to create custom metadata datasets for your own research data. Custom datasets follow the same Schema 1.0 format as official datasets, ensuring compatibility with all RegiStream features.

Why Create Custom Datasets?

  • Add metadata for proprietary or institution-specific datasets
  • Create variable labels and value labels in your preferred language
  • Maintain consistent metadata across your research projects
  • Share metadata with collaborators without sharing the actual data

Note: Custom datasets are stored locally in your RegiStream directory and are not shared with the RegiStream server. This allows you to work with sensitive or proprietary data metadata privately.

Schema Requirements

All custom datasets must follow the RegiStream schema specifications. Select your target schema version:

Creating datasets for Schema 1.0. See the complete Schema 1.0 specification for detailed reference.

File Requirements

  • Format: Semicolon-delimited CSV (`;`)
  • Encoding: UTF-8
  • Headers: First row must contain column names
  • Schema Version: Must be Schema 1.0 compliant

Required Files

You can create either or both file types:

  • Variables file: Contains variable metadata (names, labels, definitions, types, units)
  • Value labels file: Contains categorical value labels (for categorical and binary variables)

Step-by-Step Guide

1. Create Variables CSV

Create a CSV file with the following required columns (semicolon-delimited):

variable_name;variable_label;variable_definition;variable_unit;variable_type;value_label_id
age;Age;Age of respondent in years;years;continuous;
gender;Gender;Gender of respondent;categorical;1
income;Annual Income;Total annual income;SEK;continuous;
education;Education Level;Highest level of education completed;;categorical;2

Column Descriptions:

  • variable_name - Lowercase variable identifier (required)
  • variable_label - Short label (required)
  • variable_definition - Detailed description (required)
  • variable_unit - Unit of measurement (optional, leave empty if not applicable)
  • variable_type - One of: categorical, continuous, text, date, binary (required)
  • value_label_id - Integer linking to value labels file (optional, for categorical/binary only)

2. Create Value Labels CSV

Create a CSV file with value labels for categorical variables:

value_label_id;variable_name;value_labels_json;value_labels_stata;conflict;harmonized_automatically
1;gender;"{""1"": ""Male"", ""2"": ""Female"", ""3"": ""Other""}";"""1"" ""Male"" ""2"" ""Female"" ""3"" ""Other""";0;0
2;education;"{""1"": ""Elementary"", ""2"": ""High School"", ""3"": ""University""}";"""1"" ""Elementary"" ""2"" ""High School"" ""3"" ""University""";0;0

Column Descriptions:

  • value_label_id - Unique integer identifier (matches variable file)
  • variable_name - Variable this label set applies to
  • value_labels_json - JSON format labels (double-quoted)
  • value_labels_stata - Stata-optimized format (space-separated quoted pairs)
  • conflict - Harmonization conflict flag (0 or 1, optional)
  • harmonized_automatically - Auto-harmonization flag (0 or 1, optional)

Important: Both value_labels_json and value_labels_stata are required and must contain the same label mappings. The JSON uses double quotes escaped as "" in CSV.

3. File Naming Convention

Name your CSV files following this pattern:

{domain}_{type}_{language}.csv

Examples:
mydata_variables_eng.csv
mydata_value_labels_eng.csv
hospital_variables_swe.csv
survey2024_variables_eng.csv

Naming Components:

  • domain - Your dataset identifier (lowercase, no spaces)
  • type - Either variables or value_labels
  • language - Language code (e.g., eng, swe, fra)

4. Directory Structure

Place your CSV files directly in the RegiStream autolabel keys directory:

# On Mac/Linux:
~/.registream/autolabel_keys/

# On Windows:
%USERPROFILE%\AppData\Local\registream\autolabel_keys\

# Example structure:
~/.registream/autolabel_keys/
├── mydata_variables_eng.csv
├── mydata_value_labels_eng.csv
├── survey2024_variables_eng.csv
└── survey2024_value_labels_eng.csv

Note: CSV files are placed directly in the autolabel_keys directory. RegiStream will automatically detect and process them based on their filename.

Complete Example

Let's create a complete custom dataset for a fictional survey:

Step 1: Create Variables CSV

File: ~/.registream/autolabel_keys/survey2024_variables_eng.csv

variable_name;variable_label;variable_definition;variable_unit;variable_type;value_label_id
id;Respondent ID;Unique identifier for each respondent;;text;
age;Age;Age of respondent in years;years;continuous;
gender;Gender;Self-identified gender;;categorical;1
employed;Employment Status;Currently employed (yes/no);;binary;2
satisfaction;Job Satisfaction;Level of job satisfaction (1-5);;categorical;3
income;Annual Income;Total household income;USD;continuous;

Step 2: Create Value Labels CSV

File: ~/.registream/autolabel_keys/survey2024_value_labels_eng.csv

value_label_id;variable_name;value_labels_json;value_labels_stata;conflict;harmonized_automatically
1;gender;"{""1"": ""Male"", ""2"": ""Female"", ""3"": ""Non-binary"", ""9"": ""Prefer not to say""}";"""1"" ""Male"" ""2"" ""Female"" ""3"" ""Non-binary"" ""9"" ""Prefer not to say""";0;0
2;employed;"{""0"": ""No"", ""1"": ""Yes""}";"""0"" ""No"" ""1"" ""Yes""";0;0
3;satisfaction;"{""1"": ""Very dissatisfied"", ""2"": ""Dissatisfied"", ""3"": ""Neutral"", ""4"": ""Satisfied"", ""5"": ""Very satisfied""}";"""1"" ""Very dissatisfied"" ""2"" ""Dissatisfied"" ""3"" ""Neutral"" ""4"" ""Satisfied"" ""5"" ""Very satisfied""";0;0

Step 3: Use in Stata

* Load your custom dataset metadata
autolabel variables, domain(survey2024) lang(eng)
autolabel values, domain(survey2024) lang(eng)

* Now your variables have labels!
describe
tab gender
tab satisfaction

Testing Your Dataset

After creating your CSV files, test that they work correctly:

1. Verify File Format

# Check encoding (should be UTF-8)
file -I ~/.registream/autolabel_keys/mydata_variables_eng.csv

# Check delimiter (should show semicolons)
head -1 ~/.registream/autolabel_keys/mydata_variables_eng.csv

2. Test in Stata

* Try loading your custom metadata
autolabel variables, domain(mydata) lang(eng)

* Check for errors
* If successful, you should see:
* "Variables metadata loaded successfully"

* Verify the labels were applied
describe
labelbook

3. Check for Common Issues

  • File encoding is UTF-8 (not UTF-16 or Windows-1252)
  • Delimiter is semicolon (`;`), not comma
  • Column names match exactly (case-sensitive)
  • No missing required columns
  • JSON in value labels is properly escaped

Troubleshooting

File Not Found Error

Problem: RegiStream can't find your CSV file.

Solution:

  • Check file is in correct directory: ~/.registream/autolabel_keys/
  • Check filename matches exactly: {domain}_{type}_{language}.csv
  • On Mac/Linux, verify ~ expands correctly (use full path if needed)
  • File should be directly in autolabel_keys, not in a subfolder

Schema Validation Failed

Problem: "Schema validation failed: Required column 'variable_label' not found!"

Solution:

  • Verify all required columns are present
  • Check column names are spelled correctly (case-sensitive)
  • Ensure first row contains headers, not data

Encoding Issues

Problem: Special characters appear garbled (e.g., "café" shows as "café")

Solution:

  • Save CSV as UTF-8 encoding (not UTF-8 with BOM)
  • In Excel: Save As → CSV UTF-8 (not regular CSV)
  • In text editors: Set encoding to UTF-8 before saving

Value Labels Not Working

Problem: Variables show codes instead of labels

Solution:

  • Check value_label_id in variables file matches value_label_id in value labels file
  • Verify both value_labels_json and value_labels_stata are present and identical
  • Ensure JSON is properly escaped (use "" for quotes inside CSV)

Getting Help

If you continue experiencing issues:

  • Check the GitHub Issues for similar problems
  • Share your CSV file structure (first few rows) when reporting issues
  • Include the exact error message from Stata/Python