Overview
RegiStream allows you to create custom metadata datasets for your own research data. Custom datasets follow the same Schema 1.0 format as official datasets, ensuring compatibility with all RegiStream features.
Why Create Custom Datasets?
- Add metadata for proprietary or institution-specific datasets
- Create variable labels and value labels in your preferred language
- Maintain consistent metadata across your research projects
- Share metadata with collaborators without sharing the actual data
Note: Custom datasets are stored locally in your RegiStream directory and are not shared with the RegiStream server. This allows you to work with sensitive or proprietary data metadata privately.
Schema Requirements
All custom datasets must follow the RegiStream schema specifications. Select your target schema version:
Creating datasets for Schema 1.0. See the complete Schema 1.0 specification for detailed reference.
File Requirements
- Format: Semicolon-delimited CSV (`;`)
- Encoding: UTF-8
- Headers: First row must contain column names
- Schema Version: Must be Schema 1.0 compliant
Required Files
You can create either or both file types:
- Variables file: Contains variable metadata (names, labels, definitions, types, units)
- Value labels file: Contains categorical value labels (for categorical and binary variables)
Step-by-Step Guide
1. Create Variables CSV
Create a CSV file with the following required columns (semicolon-delimited):
variable_name;variable_label;variable_definition;variable_unit;variable_type;value_label_id
age;Age;Age of respondent in years;years;continuous;
gender;Gender;Gender of respondent;categorical;1
income;Annual Income;Total annual income;SEK;continuous;
education;Education Level;Highest level of education completed;;categorical;2
Column Descriptions:
variable_name- Lowercase variable identifier (required)variable_label- Short label (required)variable_definition- Detailed description (required)variable_unit- Unit of measurement (optional, leave empty if not applicable)variable_type- One of:categorical,continuous,text,date,binary(required)value_label_id- Integer linking to value labels file (optional, for categorical/binary only)
2. Create Value Labels CSV
Create a CSV file with value labels for categorical variables:
value_label_id;variable_name;value_labels_json;value_labels_stata;conflict;harmonized_automatically
1;gender;"{""1"": ""Male"", ""2"": ""Female"", ""3"": ""Other""}";"""1"" ""Male"" ""2"" ""Female"" ""3"" ""Other""";0;0
2;education;"{""1"": ""Elementary"", ""2"": ""High School"", ""3"": ""University""}";"""1"" ""Elementary"" ""2"" ""High School"" ""3"" ""University""";0;0
Column Descriptions:
value_label_id- Unique integer identifier (matches variable file)variable_name- Variable this label set applies tovalue_labels_json- JSON format labels (double-quoted)value_labels_stata- Stata-optimized format (space-separated quoted pairs)conflict- Harmonization conflict flag (0 or 1, optional)harmonized_automatically- Auto-harmonization flag (0 or 1, optional)
Important: Both value_labels_json and value_labels_stata are required
and must contain the same label mappings. The JSON uses double quotes escaped as "" in CSV.
3. File Naming Convention
Name your CSV files following this pattern:
{domain}_{type}_{language}.csv
Examples:
mydata_variables_eng.csv
mydata_value_labels_eng.csv
hospital_variables_swe.csv
survey2024_variables_eng.csv
Naming Components:
domain- Your dataset identifier (lowercase, no spaces)type- Eithervariablesorvalue_labelslanguage- Language code (e.g.,eng,swe,fra)
4. Directory Structure
Place your CSV files directly in the RegiStream autolabel keys directory:
# On Mac/Linux:
~/.registream/autolabel_keys/
# On Windows:
%USERPROFILE%\AppData\Local\registream\autolabel_keys\
# Example structure:
~/.registream/autolabel_keys/
├── mydata_variables_eng.csv
├── mydata_value_labels_eng.csv
├── survey2024_variables_eng.csv
└── survey2024_value_labels_eng.csv
Note: CSV files are placed directly in the autolabel_keys directory. RegiStream will automatically detect and process them based on their filename.
Complete Example
Let's create a complete custom dataset for a fictional survey:
Step 1: Create Variables CSV
File: ~/.registream/autolabel_keys/survey2024_variables_eng.csv
variable_name;variable_label;variable_definition;variable_unit;variable_type;value_label_id
id;Respondent ID;Unique identifier for each respondent;;text;
age;Age;Age of respondent in years;years;continuous;
gender;Gender;Self-identified gender;;categorical;1
employed;Employment Status;Currently employed (yes/no);;binary;2
satisfaction;Job Satisfaction;Level of job satisfaction (1-5);;categorical;3
income;Annual Income;Total household income;USD;continuous;
Step 2: Create Value Labels CSV
File: ~/.registream/autolabel_keys/survey2024_value_labels_eng.csv
value_label_id;variable_name;value_labels_json;value_labels_stata;conflict;harmonized_automatically
1;gender;"{""1"": ""Male"", ""2"": ""Female"", ""3"": ""Non-binary"", ""9"": ""Prefer not to say""}";"""1"" ""Male"" ""2"" ""Female"" ""3"" ""Non-binary"" ""9"" ""Prefer not to say""";0;0
2;employed;"{""0"": ""No"", ""1"": ""Yes""}";"""0"" ""No"" ""1"" ""Yes""";0;0
3;satisfaction;"{""1"": ""Very dissatisfied"", ""2"": ""Dissatisfied"", ""3"": ""Neutral"", ""4"": ""Satisfied"", ""5"": ""Very satisfied""}";"""1"" ""Very dissatisfied"" ""2"" ""Dissatisfied"" ""3"" ""Neutral"" ""4"" ""Satisfied"" ""5"" ""Very satisfied""";0;0
Step 3: Use in Stata
* Load your custom dataset metadata
autolabel variables, domain(survey2024) lang(eng)
autolabel values, domain(survey2024) lang(eng)
* Now your variables have labels!
describe
tab gender
tab satisfaction
Testing Your Dataset
After creating your CSV files, test that they work correctly:
1. Verify File Format
# Check encoding (should be UTF-8)
file -I ~/.registream/autolabel_keys/mydata_variables_eng.csv
# Check delimiter (should show semicolons)
head -1 ~/.registream/autolabel_keys/mydata_variables_eng.csv
2. Test in Stata
* Try loading your custom metadata
autolabel variables, domain(mydata) lang(eng)
* Check for errors
* If successful, you should see:
* "Variables metadata loaded successfully"
* Verify the labels were applied
describe
labelbook
3. Check for Common Issues
- File encoding is UTF-8 (not UTF-16 or Windows-1252)
- Delimiter is semicolon (`;`), not comma
- Column names match exactly (case-sensitive)
- No missing required columns
- JSON in value labels is properly escaped
Troubleshooting
File Not Found Error
Problem: RegiStream can't find your CSV file.
Solution:
- Check file is in correct directory:
~/.registream/autolabel_keys/ - Check filename matches exactly:
{domain}_{type}_{language}.csv - On Mac/Linux, verify
~expands correctly (use full path if needed) - File should be directly in autolabel_keys, not in a subfolder
Schema Validation Failed
Problem: "Schema validation failed: Required column 'variable_label' not found!"
Solution:
- Verify all required columns are present
- Check column names are spelled correctly (case-sensitive)
- Ensure first row contains headers, not data
Encoding Issues
Problem: Special characters appear garbled (e.g., "café" shows as "café")
Solution:
- Save CSV as UTF-8 encoding (not UTF-8 with BOM)
- In Excel: Save As → CSV UTF-8 (not regular CSV)
- In text editors: Set encoding to UTF-8 before saving
Value Labels Not Working
Problem: Variables show codes instead of labels
Solution:
- Check
value_label_idin variables file matchesvalue_label_idin value labels file - Verify both
value_labels_jsonandvalue_labels_stataare present and identical - Ensure JSON is properly escaped (use
""for quotes inside CSV)
Getting Help
If you continue experiencing issues:
- Check the GitHub Issues for similar problems
- Share your CSV file structure (first few rows) when reporting issues
- Include the exact error message from Stata/Python