Philosophy
RegiStream for Python is designed to integrate seamlessly with pandas, making the labeling process as intuitive as possible.
We use pandas accessors to provide a natural, pandas-native experience. By simply adding .lab to your DataFrame
or Series, you gain access to labeled versions of your data. This approach means you can continue using all the pandas
functionality you already know, with the added benefit of proper variable and value labels.
Whether you're previewing data, creating tables, plotting, or running regressions, RegiStream enhances your workflow
without changing it - you just add .lab to bring your labels into the picture.
Installation
RegiStream for Python can be installed via pip or manually for offline environments.
Pip Install (Recommended)
The quickest way to install RegiStream is directly from PyPI using pip:
pip install registream
This command will download and install the latest version of RegiStream (currently v1.0.1), including all necessary dependencies.
Offline Installation for Secure Environments
If you are working on an offline server or a high-security environment (e.g., MONA), you can manually download the RegiStream wheel file and transfer it to the secure system.
After downloading, follow these steps:
- Download the wheel file (
*.whl) from PyPI on your local system - Transfer the wheel file to your secure server system
- On the secure system, install the package from the local wheel file:
pip install /path/to/registream-1.0.1-py3-none-any.whl
Note: For information on downloading and setting up metadata files in secure environments, see the Datasets Guide.
Verification
To verify the installation was successful, run:
import registream
registream.__version__
# Expected output: '1.0.1'
This should display the installed version.
Uninstalling
To uninstall RegiStream:
pip uninstall registream
Updating RegiStream
To update RegiStream to the latest version, use pip's upgrade option:
# Update to latest version
pip install --upgrade registream
# Or check for updates first
pip list --outdated | grep registream
Note: Updating the package will not affect your cached metadata files. Metadata is stored
separately in ~/.registream/autolabel_keys/ and persists across package updates.
Checking Current Version
To see which version you have installed:
import registream
print(registream.__version__)
Quick Start
Get started with RegiStream in just a few lines of code!
import pandas as pd
import registream
# Load your data
df = pd.read_stata('path/to/your/data.dta')
# Apply variable labels
df.autolabel(domain='scb', lang='eng')
# Apply value labels
df.autolabel(label_type='values', domain='scb', lang='eng')
# View labeled data
df.lab.head()
That's it! Your DataFrame now has variable and value labels that you can access using the .lab accessor.
Usage Guide
Applying Labels to DataFrames
The autolabel() method applies metadata labels to your pandas DataFrame:
# Apply variable labels (labels for column names)
df.autolabel(domain='scb', lang='eng')
# Apply value labels (labels for categorical values)
df.autolabel(label_type='values', domain='scb', lang='eng')
# Apply both at once
df.autolabel(domain='scb', lang='eng')
df.autolabel(label_type='values', domain='scb', lang='eng')
Viewing Labeled Data
Use the .lab accessor to view your data with labels:
# Preview with variable labels (column names are labeled)
df.lab.head()
# Preview with both variable and value labels
df.lab.show_values().head()
# View specific columns with labels
cols_to_show = ['astsni2007', 'ssyk3', 'astsni2002']
df[cols_to_show].lab.head()
What's the difference?
df.lab.head()- Shows data with variable labels as column namesdf.lab.show_values().head()- Shows data with both variable labels AND value labels for categorical data
Tabulation with Labels
Create frequency tables using labeled values:
# Without labels
df.astsni2007.value_counts()
# With value labels
df.lab.astsni2007.value_counts()
# Crosstab with labels
pd.crosstab(df.lab.kon, df.lab.syssstatj)
Creating Labeled Plots
Use .lab with plotting libraries for automatically labeled visualizations:
import matplotlib.pyplot as plt
import seaborn as sns
# Aggregate data by year and employment status
agg_df = df.groupby(["examar", "syssstatj"], as_index=False).agg({"inkpens": "mean"})
# Create a plot with labeled variables
plt.figure(figsize=(10, 6))
ax = sns.scatterplot(data=agg_df.lab, x='examar', y='inkpens', hue='syssstatj', palette="viridis")
# Style adjustments
plt.title('Pension Income by Year and Employment Status')
plt.xlabel('Year')
plt.ylabel('Average Pension Income')
plt.xticks(rotation=45)
plt.show()
Regression Analysis with Labels
Extract labeled variables for statistical modeling:
import statsmodels.api as sm
# Extract labeled variables
income = df.lab['dispinkfam04'] # Numeric variable
industry = df.lab['astsni2007'] # Categorical variable
outcome = df.lab['dispink04'] # Target variable
# Convert categorical variable into one-hot encoding
X = pd.get_dummies(industry, drop_first=True, dtype=float)
X[income.name] = income.astype(float)
# Add intercept
X = sm.add_constant(X)
# Run the regression
model = sm.OLS(outcome.astype(float), X).fit()
print(model.summary())
API Reference
autolabel()
Apply variable or value labels to a DataFrame.
df.autolabel(label_type='variables', domain='scb', lang='eng', force=False)
Parameters:
| Parameter | Type | Description |
|---|---|---|
label_type |
str | Type of labels: 'variables' or 'values' (default: 'variables') |
domain |
str | Dataset domain (currently: 'scb') |
lang |
str | Language: 'eng' (English) or 'swe' (Swedish) |
force |
bool | Force re-download of metadata (default: False) |
.lab Accessor
Access labeled versions of your DataFrame or Series.
# Access labeled DataFrame
df.lab.head()
# Access labeled column
df.lab.astsni2007
# Show value labels
df.lab.show_values().head()
meta_search()
Search for variables containing a specific keyword in their name or label.
# Search for variables related to "industry"
df.meta_search("industry")
# Search for "income" variables
df.meta_search("income")
Variable Label Methods
Get and set variable labels programmatically:
# Get a single variable label
df.get_variable_labels('astsni2007')
# Get multiple variable labels
df.get_variable_labels(['astsni2007', 'ssyk3'])
# Set a variable label
df.set_variable_labels('astsni2007', "Industry classification (SNI 2007)")
Value Label Methods
Get and set value labels for categorical variables:
# Get value labels for a variable
labels = df.get_value_labels('astsni2007')
# Update value labels
updates = {'00000': 'custom label 1', '01110': 'custom label 2'}
df.set_value_labels('astsni2007', updates)
lookup()
Look up variable metadata from the RegiStream domain without loading data:
from registream import lookup
# Lookup a variable in the SCB domain
lookup('astsni2007', domain='scb', lang='eng')
Configuration
Data Storage Locations
By default, RegiStream stores metadata files in your system's user home directory:
- macOS/Linux:
~/.registream/autolabel_keys/ - Windows:
%USERPROFILE%\AppData\Local\registream\autolabel_keys\
This default location enables seamless sharing of label data across projects and programming languages (Stata, Python, R).
Secure Environments (MONA, etc.)
If you are working on a secure system that does not have a standard user home directory, you need to specify a custom directory for RegiStream to store files.
Option 1: Set in Your Script
import os
os.environ['REGISTREAM_DIR'] = "/path/to/custom/directory"
# Then import and use registream
import registream
Option 2: Set System-Wide Environment Variable
Linux/macOS:
Add to your .bashrc, .zshrc, or equivalent shell configuration file:
export REGISTREAM_DIR="/path/to/custom/directory"
Then reload your shell configuration: source ~/.bashrc or restart your terminal.
Windows:
Set via System Properties → Advanced → Environment Variables, or use Command Prompt (as Administrator):
setx REGISTREAM_DIR "C:\path\to\custom\directory"
Restart any command prompts or applications for the change to take effect.
Setting Up Data Files in Secure Environments
After setting your custom directory, create the required folder structure:
# Create directories
/path/to/custom/directory/registream/
/path/to/custom/directory/registream/autolabel_keys/
Download and extract metadata files into the appropriate subdirectories:
- Download the ZIP files (e.g.,
scb_variables_eng.zip) - Extract to get folders like
scb_variables_eng/ - Move the CSV files into
/path/to/custom/directory/registream/autolabel_keys/scb_variables_eng/
Important: Folder names must match exactly (e.g., scb_variables_eng).
RegiStream will automatically recognize and use them when you run your code.
Troubleshooting
Import Error: No module named 'registream'
Make sure RegiStream is installed in your current Python environment:
pip install registream
Labels Not Appearing
Common causes:
- Metadata not downloaded: First run of
autolabel()will download metadata automatically if you have internet - Variable names don't match: Check that your DataFrame column names match the metadata (case-insensitive)
- Wrong domain/language: Verify you're using the correct
domainandlangparameters
File Not Found Errors in Secure Environments
Ensure you've set the REGISTREAM_DIR environment variable correctly and that metadata files are in the right location.
Getting Help
For more help:
Additional Resources
For dataset-specific documentation, see the Datasets Guide. To create custom metadata, check out the Custom Datasets Guide.