datamirror · Stata
Stata reference
Complete command reference for datamirror in Stata — workflow, commands, the four-layer architecture, supported estimation families, and privacy controls.
Install
Requires Stata 16.0 or later.
net install registream, from("https://registream.org/install/stata/latest") replace
net install registream installs core + every module (autolabel + datamirror). If you only want datamirror, install just that one: net install datamirror pulls core in as a dependency automatically. Full install notes and first-run wizard: install guide.
Quick start
* 1. Load your sensitive data
use "my_confidential_data.dta", clear
* 2. Initialize a datamirror session
datamirror init, checkpoint_dir("output") replace
* 3. Run your key analyses and checkpoint the results
reg employed age female
datamirror checkpoint, tag("employment_model")
reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")
* 4. Extract metadata (distributions + checkpoint targets)
datamirror extract, replace
* 5. Generate synthetic data from the metadata
datamirror rebuild using "output", clear seed(12345)
* 6. Validate fidelity
datamirror check using "output"
* ✓ employment_model: Max Δβ/SE = 0.02
* ✓ wellbeing_model: Max Δβ/SE = 0.05
The synthetic dataset reproduces the original coefficients within Δβ/SE < 3 — inside the Bonferroni-corrected 99% simultaneous CI. You can re-run the same analysis script outside the secure environment and get the same estimates.
Workflow
The standard datamirror flow has two phases running in different environments:
┌─────────────────────────────────────────────┐
│ EXTRACT PHASE (inside the secure env) │
│ │
│ Original Data │
│ ↓ │
│ datamirror init + checkpoint + extract │
│ ↓ │
│ Metadata files: │
│ • schema.csv │
│ • marginals_cont.csv │
│ • marginals_cat.csv │
│ • correlations.csv │
│ • checkpoints.csv │
│ • checkpoints_coef.csv │
│ • manifest.csv │
└─────────────────────────────────────────────┘
↓ transfer (small files)
┌─────────────────────────────────────────────┐
│ REBUILD PHASE (outside the secure env) │
│ │
│ datamirror rebuild using │
│ ↓ │
│ Synthetic dataset: │
│ • Marginals match (KS p > 0.05) │
│ • Correlations match (r > 0.95) │
│ • Stratification preserved │
│ • Checkpoint β match (Δβ/SE < 3) ⭐ │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ VALIDATION (anywhere) │
│ │
│ datamirror check using │
│ ↓ │
│ Fidelity report │
└─────────────────────────────────────────────┘ The metadata files (schema, marginals, correlations, checkpoints) are small plain-text CSVs that respect statistical disclosure control — small cells are suppressed automatically. They can be reviewed by a data-owner before export and moved out of the secure environment through whatever channel the agency allows.
Command — datamirror init
Initialize a datamirror session and specify the output directory.
datamirror init, checkpoint_dir(string) [strata(varname) replace]Options
checkpoint_dir(string)— directory where metadata files will be written. Required.strata(varname)— stratification variable (optional). When specified, all properties are preserved within strata — useful for panel data and treatment/control groups.replace— overwrite any existing contents ofcheckpoint_dir.
datamirror init, checkpoint_dir("output") strata(wave) replaceCommand — datamirror checkpoint
Save the current model's results as a replication target.
datamirror checkpoint, tag(string) [notes(string)]Requirements
- Must be called immediately after an estimation command (
regress,reghdfe,logit, …). - Results are read from Stata's
e()macros.
Options
tag(string)— unique identifier for this checkpoint (required).notes(string)— optional human-readable description.
Supported estimation commands
regress— OLSreghdfe— fixed effectsivregress 2sls— 2SLS (single-checkpoint and joint across shared-outcome groups)logit,probit— binary outcomespoisson,nbreg— count outcomes
Unsupported commands (xtreg, ologit, mlogit, stcox, tobit, ivreghdfe, ...) exit datamirror checkpoint cleanly with rc=199. A subset is on the v1.1 roadmap.
reg employed age female if !missing(employed, age, female)
datamirror checkpoint, tag("model1") notes("Employment regression")
reghdfe wellbeing age employed, absorb(id wave)
datamirror checkpoint, tag("fe_model") notes("FE with slopes")Command — datamirror extract
Extract all metadata from the currently-loaded dataset.
datamirror extract [, replace]Output files (written to checkpoint_dir)
metadata.csv— session infoschema.csv— variable types, labels, and integer-detection flagsmarginals_cont.csv— continuous-variable quantile distributionsmarginals_cat.csv— categorical-variable frequency tablescorrelations.csv— correlation matrixcheckpoints.csv— checkpoint registry (one row per tagged regression)checkpoints_coef.csv— target coefficients (one row per coefficient;cp_numforeign key)manifest.csv— self-describing file listing with row counts
Command — datamirror rebuild
Generate synthetic data from the extracted metadata.
datamirror rebuild using path [, clear seed(#)]Options
clear— replace the currently-loaded dataset.seed(#)— random seed for reproducibility.
What happens
- Load the on-disk metadata from
path(schema, marginals, correlations, checkpoints). - Draw a Gaussian copula sample at the seeded random state.
- Transform each dimension to its Layer 1 marginal; apply within strata if Layer 3 was used.
- Layer 4: pin the checkpointed coefficients. Linear estimators use a closed-form Newton step; shared-outcome IV groups use a joint stacked Newton step; GLMs redraw y from the canonical DGP at the target coefficients.
- Return a synthetic dataset whose regressions reproduce the checkpointed estimates within
Δβ/SE < 3.
datamirror rebuild using "output", clear seed(12345)Command — datamirror check
Validate synthetic-data fidelity against the original checkpoints.
datamirror check using pathValidations
- Marginal distributions — Kolmogorov–Smirnov tests per variable
- Correlations — Pearson correlations on the copula sample
- Checkpoint coefficients — target
Δβ/SE < 3(Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint)
Output
──────────────────────────────────────────────────────────
Checkpoint 1: employment_model
──────────────────────────────────────────────────────────
Variable Original Synthetic Δβ Δβ/SE
───────────────────────────────────────────────────────
age -0.0064 -0.0064 0.0000 0.00
female -0.1128 -0.1128 0.0000 0.00
_cons 0.8973 0.8973 0.0000 0.00
───────────────────────────────────────────────────────
Max Δβ/SE = 0.00
✓ PASS (Δβ/SE < 3)The four-layer architecture
datamirror preserves fidelity through four constraint layers plus schema metadata:
Schema (metadata)
- Variable types (numeric, categorical, string)
- Value labels
- Variable labels
- Storage formats
Layer 1 — Marginal distributions
- Continuous: 101-point quantile distributions (p0, p1, …, p100). Variables containing only integer values in the original data are auto-detected and rounded to integers in the synthetic data.
- Categorical: complete frequency tables for all values.
Layer 2 — Correlation structure
- Gaussian copula captures pairwise dependencies.
- Preserves correlations with
r > 0.95.
Layer 3 — Stratification
- All properties hold within strata (e.g. by year, region, wave, treatment group).
- Enables panel / longitudinal data synthesis with independent distributions per stratum.
Layer 4 — Checkpoint constraints ⭐
Two principled methods, one per family, with no learning rates or iteration knobs.
- Linear estimators (OLS, FE, 2SLS). β̂ is a linear functional of y, so the exact shift that moves β̂ to the target β* is computable in closed form:
delta_y = X · (β* − β̂)via Stata'smatrix score. One step, no iteration. - Shared-outcome 2SLS groups. When multiple IV specifications share an outcome (e.g. main-shock and gender-shock specs on the same dependent variable), a joint stacked Newton step pins every coefficient constraint simultaneously.
- Generalized linear models (logit, probit, poisson, nbreg). y is resampled from the canonical data-generating process at the target coefficients. Factor-level coefficients are preserved by construction.
Fidelity target: Δβ/SE < 3 (Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint). Validated on four AEA replication packages: 390+ coefficient comparisons, none exceed the threshold.
datamirror extract emits a warning in this regime; the regression will still run, but Δβ/SE may be elevated for coefficients involving the rare binary. A v1.1 copula refinement is planned.
Supported estimation families (Layer 4)
The fidelity target for every command is Δβ/SE < 3. Unsupported commands exit datamirror checkpoint cleanly with a pointer to this page.
| Family | Commands | Method |
|---|---|---|
| Linear | regress, reghdfe | Closed-form Newton via matrix score |
| 2SLS | ivregress 2sls (single + joint) | Weighted-FWL Newton; joint stack for shared outcomes |
| Binary | logit, probit | Direct Bernoulli DGP at target coefficients |
| Count | poisson, nbreg | Direct Poisson / Gamma-Poisson DGP at target coefficients |
Privacy & statistical disclosure control
datamirror implements Statistical Disclosure Control (SDC) to ensure that checkpoint metadata cannot be used to identify small groups or individuals. All privacy controls are governed by a single global parameter, DM_MIN_CELL_SIZE.
Default: 50 — the strictest threshold of any major national statistical agency. Groups or strata with fewer than 50 observations are automatically suppressed from the checkpoint metadata.
| Threshold | Use case | Agencies |
|---|---|---|
| 50 | Maximum safety — multi-agency compliance | All (strictest) |
| 20 | Standard microdata release | Statistics Sweden, Eurostat |
| 10 | Internal use within secure environment | US Census Bureau, UK ONS, Eurostat (tables) |
| 5 | Not recommended | Below most microdata standards |
Change by setting the global before datamirror extract:
global DM_MIN_CELL_SIZE = 20
datamirror extract, replaceLimitations
Scope: cross-sectional microdata
datamirror v1 targets cross-sectional economic microdata, which represents the majority of Stata-based empirical workflows.
| Design | Support | Notes |
|---|---|---|
| Cross-sectional RCT | ✓ Full | Primary use case |
| Balance checks | ✓ Full | Supported via standard OLS / FE checkpoints |
| OLS / FE regression | ✓ Full | Supported |
| Treatment effects | ✓ Full | With stratification for interactions |
| IV (distinct instruments) | ✓ Full | Supported |
| IV (correlated instruments) | ∼ Partial | Known limitation — methodological work in progress |
| Panel data (within-unit time-series) | ✗ | Requires fundamentally different dependence structure |
| Event studies / DiD | ✗ | Requires pre/post correlation within units |
Panel data support requires a different generative architecture (temporal copulas, hierarchical models, state-space approaches) and is a separate methodological direction for future work.
Marginal drift on the outcome
Layer 4 either shifts the outcome y (linear family) or resamples it (GLM family) to pin the target coefficients. Predictor marginals are preserved intact; the outcome marginal drifts by O(1/√N) as a consequence.
- Predictors (age, education, income, ...) are preserved.
- Outcomes may drift slightly in mean or SD as a result of the coefficient-matching step.
Rationale: predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim. If you report a summary-statistics table on the outcome variable, add a caveat noting that the mean and SD reflect the synthetic generation rather than the original.
Output files
After datamirror extract, the checkpoint_dir contains:
| File | Contents |
|---|---|
metadata.csv | Session info, version, timestamp |
schema.csv | varname, type, format, storage, is_integer |
marginals_cont.csv | 101-point quantile distributions for continuous variables |
marginals_cat.csv | Frequency tables for categorical variables |
correlations.csv | Pairwise correlation matrix |
checkpoints.csv | Checkpoint registry (tag, notes, cmd, model, …) |
checkpoints_coef.csv | Target coefficients, long format with cp_num foreign key |
manifest.csv | Self-describing listing of every file with row counts and descriptions |
All files are plain CSV. They can be reviewed by a data-owner before export and moved out of a secure environment through any standard file-transfer channel.
See also
- datamirror overview — positioning and the four-layer architecture at a glance
- Fidelity tiers — the four-layer invariant ladder
- Install guide — including secure-environment setup
- Citation — or run
datamirror citefor a version-pinned block