datamirror · Stata
Stata reference
Complete command reference for datamirror in Stata — workflow, commands, the four-layer architecture, supported estimation families, and privacy controls.
Install
Requires Stata 16.0 or later. Install the registream core first, then datamirror:
net install registream, from("https://registream.org/install/stata/registream/latest") replace
net install datamirror, from("https://registream.org/install/stata/datamirror/latest") replace
datamirror requires registream (the shared core: configuration, first-run wizard, update management) and prints install instructions at runtime if it's missing. Full install notes and first-run wizard: install guide.
Quick start
* 1. Load your sensitive data
use "my_confidential_data.dta", clear
* 2. Initialize a datamirror session
datamirror init, checkpoint_dir("output") replace
* 3. Run your key analyses and checkpoint the results
reg employed age female
datamirror checkpoint, tag("employment_model")
reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")
* 4. Extract metadata (distributions + checkpoint targets)
datamirror extract, replace
* 5. Generate synthetic data from the metadata
datamirror rebuild using "output", clear seed(12345)
* 6. Validate fidelity
datamirror check using "output"
* ✓ employment_model: Max Δβ/SE = 0.02
* ✓ wellbeing_model: Max Δβ/SE = 0.05
The synthetic dataset reproduces the original coefficients within Δβ/SE < 3, the 99.7% single-coefficient band used as a fixed diagnostic tolerance. You can re-run the same analysis script outside the secure environment and get the same estimates.
Workflow
The standard datamirror flow has two phases running in different environments:
┌─────────────────────────────────────────────┐
│ EXTRACT PHASE (inside the secure env) │
│ │
│ Original Data │
│ ↓ │
│ datamirror init + checkpoint + extract │
│ ↓ │
│ Metadata files: │
│ • schema.csv │
│ • marginals_cont.csv │
│ • marginals_cat.csv │
│ • correlations.csv │
│ • checkpoints.csv │
│ • checkpoints_coef.csv │
│ • manifest.csv │
└─────────────────────────────────────────────┘
↓ transfer (small files)
┌─────────────────────────────────────────────┐
│ REBUILD PHASE (outside the secure env) │
│ │
│ datamirror rebuild using │
│ ↓ │
│ Synthetic dataset: │
│ • Marginals match (KS p > 0.05) │
│ • Correlations match (r > 0.95) │
│ • Stratification preserved │
│ • Checkpoint β match (Δβ/SE < 3) ⭐ │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ VALIDATION (anywhere) │
│ │
│ datamirror check using │
│ ↓ │
│ Fidelity report │
└─────────────────────────────────────────────┘ The metadata files (schema, marginals, correlations, checkpoints) are small plain-text CSVs that respect statistical disclosure control — small cells are suppressed automatically. They can be reviewed by a data-owner before export and moved out of the secure environment through whatever channel the agency allows.
Command — datamirror init
Initialize a datamirror session and specify the output directory.
datamirror init, checkpoint_dir(string) [strata(varname) replace]Options
checkpoint_dir(string)— directory where metadata files will be written. Required.strata(varname)— stratification variable (optional). When specified, all properties are preserved within strata — useful for panel data and treatment/control groups.replace— overwrite any existing contents ofcheckpoint_dir.
datamirror init, checkpoint_dir("output") strata(wave) replaceCommand — datamirror checkpoint
Save the current model's results as a replication target.
datamirror checkpoint, tag(string) [notes(string)]Requirements
- Must be called immediately after an estimation command (
regress,reghdfe,logit, …). - Results are read from Stata's
e()macros.
Options
tag(string)— unique identifier for this checkpoint (required).notes(string)— optional human-readable description.
Supported estimation commands
regress— OLSreghdfe— fixed effectsivregress 2sls— 2SLS (single-checkpoint and joint across shared-outcome groups)logit,logistic,probit— binary outcomespoisson,nbreg— count outcomes
Unsupported commands (xtreg, ologit, mlogit, stcox, tobit, ivreghdfe, ...) exit datamirror checkpoint cleanly with rc=199. A subset is on the v1.1 roadmap.
reg employed age female if !missing(employed, age, female)
datamirror checkpoint, tag("model1") notes("Employment regression")
reghdfe wellbeing age employed, absorb(id wave)
datamirror checkpoint, tag("fe_model") notes("FE with slopes")Command — datamirror extract
Extract all metadata from the currently-loaded dataset.
datamirror extract [, replace]Output files (written to checkpoint_dir)
metadata.csv— session infoschema.csv— variable types, labels, and integer-detection flagsmarginals_cont.csv— continuous-variable quantile distributionsmarginals_cat.csv— categorical-variable frequency tablescorrelations.csv— correlation matrixcheckpoints.csv— checkpoint registry (one row per tagged regression)checkpoints_coef.csv— target coefficients (one row per coefficient;cp_numforeign key)manifest.csv— self-describing file listing with row counts
Command — datamirror rebuild
Generate synthetic data from the extracted metadata.
datamirror rebuild using path [, clear seed(#) n(#) scale(#) verify]Options
clear— replace the currently-loaded dataset.seed(#)— random seed for reproducibility.n(#)— redraw at # observations instead of the bundle's recorded N; relationships and pinned coefficients are preserved at the requested size.scale(#)— redraw at a multiple of the recorded N rather than an absolute count (scale(0.001)draws one-thousandth,scale(2)draws double). Mutually exclusive withn(#); both resolve to the same target size.verify— after rebuild, re-run every checkpoint and report Δβ/SE.
What happens
- Load the on-disk metadata from
path(schema, marginals, correlations, checkpoints). - Draw a Gaussian copula sample at the seeded random state.
- Transform each dimension to its Layer 1 marginal; apply within strata if Layer 3 was used.
- Layer 4: pin the checkpointed coefficients. Linear estimators use a closed-form Newton step; shared-outcome IV groups use a joint stacked Newton step; GLMs redraw y from the canonical DGP at the target coefficients.
- Return a synthetic dataset whose regressions reproduce the checkpointed estimates within
Δβ/SE < 3.
datamirror rebuild using "output", clear seed(12345)Command — datamirror check
Validate synthetic-data fidelity against the original checkpoints.
datamirror check using path [, saving(filename)]saving(filename) exports the per-coefficient fidelity table to CSV (used by the replication verify.do scripts).
Validations
- Marginal distributions — Kolmogorov–Smirnov tests per variable
- Correlations — Spearman rank correlations on the copula sample
- Checkpoint coefficients — target
Δβ/SE < 3(a fixed diagnostic tolerance; 3 is the 99.7% single-coefficient band)
Output
──────────────────────────────────────────────────────────
Checkpoint 1: employment_model
──────────────────────────────────────────────────────────
Variable Original Synthetic Δβ Δβ/SE
───────────────────────────────────────────────────────
age -0.0064 -0.0064 0.0000 0.00
female -0.1128 -0.1128 0.0000 0.00
_cons 0.8973 0.8973 0.0000 0.00
───────────────────────────────────────────────────────
Max Δβ/SE = 0.00
✓ PASS (Δβ/SE < 3)The four-layer architecture
datamirror preserves fidelity through four constraint layers plus schema metadata:
Schema (metadata)
- Variable types (numeric, categorical, string)
- Value labels
- Variable labels
- Storage formats
Layer 1 — Marginal distributions
- Continuous: 101-point quantile distributions (p0, p1, …, p100). Variables containing only integer values in the original data are auto-detected and rounded to integers in the synthetic data.
- Categorical: complete frequency tables for all values.
Layer 2 — Correlation structure
- Gaussian copula captures pairwise dependencies.
- Preserves correlations with
r > 0.95.
Layer 3 — Stratification
- All properties hold within strata (e.g. by year, region, wave, treatment group).
- Enables per-wave (cross-sectional) synthesis with independent distributions per stratum. Within-unit time-series dependence is not modeled (see Limitations).
Layer 4 — Checkpoint constraints ⭐
Two principled methods, one per family, with no learning rates or iteration knobs.
- Linear estimators (OLS, FE, 2SLS). β̂ is a linear functional of y, so the exact shift that moves β̂ to the target β* is computable in closed form:
delta_y = X · (β* − β̂)via Stata'smatrix score. One step, no iteration. - Shared-outcome 2SLS groups. When multiple IV specifications share an outcome (e.g. main-shock and gender-shock specs on the same dependent variable), a joint stacked Newton step pins every coefficient constraint simultaneously.
- Generalized linear models (logit, probit, poisson, nbreg). y is resampled from the canonical data-generating process at the target coefficients. Factor-level coefficients are preserved by construction.
Fidelity target: Δβ/SE < 3 (a fixed diagnostic tolerance; 3 is the 99.7% single-coefficient band). Validated on four AEA replication packages; the checkpointed regressions clear the threshold.
datamirror extract emits a warning in this regime; the regression will still run, but Δβ/SE may be elevated for coefficients involving the rare binary. A v1.1 copula refinement is planned.
Supported estimation families (Layer 4)
The fidelity target for every command is Δβ/SE < 3. Unsupported commands exit datamirror checkpoint cleanly with a pointer to this page.
| Family | Commands | Method |
|---|---|---|
| Linear | regress, reghdfe | Closed-form Newton via matrix score |
| 2SLS | ivregress 2sls (single + joint) | Weighted-FWL Newton; joint stack for shared outcomes |
| Binary | logit, probit | Direct Bernoulli DGP at target coefficients |
| Count | poisson, nbreg | Direct Poisson / Gamma-Poisson DGP at target coefficients |
Privacy & statistical disclosure control
datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. Three controls govern the output, each resolved the same way (init option > registream config > package default):
min_cell_size(default 50) — suppress categorical cells and strata with fewer observations than the threshold.quantile_trim(default 1) — top/bottom-code continuous extremes so the synthetic support is[p01, p99], not the raw min/max.min_resid_df(default 10) — minimum residual degrees of freedom for a regression checkpoint.
The default cell size of 50 is the strictest threshold of any major national statistical agency.
| Threshold | Use case | Agencies |
|---|---|---|
| 50 | Maximum safety — multi-agency compliance | All (strictest) |
| 20 | Standard microdata release | Statistics Sweden, Eurostat |
| 10 | Internal use within secure environment | US Census Bureau, UK ONS, Eurostat (tables) |
| 5 | Not recommended | Below most microdata standards |
Set a control for one extract via the init option, or persist it across sessions via the registream config:
datamirror init, checkpoint_dir("output") replace min_cell_size(20)
registream config, dm_min_cell_size(20) // persistent
Regression checkpoints are gated at capture time: a checkpoint is rejected if its sample N is below min_cell_size or its residual df is below min_resid_df, and any coefficient describing a sub-min_cell_size group is dropped while the rest of the regression is kept. The resolved thresholds and the suppression counts (including n_coef_suppressed) are recorded in metadata.csv. Full detail on the Privacy & SDC page.
Limitations
Scope: cross-sectional microdata
datamirror v1 targets cross-sectional economic microdata, which represents the majority of Stata-based empirical workflows.
| Design | Support | Notes |
|---|---|---|
| Cross-sectional RCT | ✓ Full | Primary use case |
| Balance checks | ✓ Full | Supported via standard OLS / FE checkpoints |
| OLS / FE regression | ✓ Full | Supported |
| Treatment effects | ✓ Full | With stratification for interactions |
| IV (distinct instruments) | ✓ Full | Supported |
| IV (correlated instruments) | ∼ Partial | Known limitation — methodological work in progress |
| Panel data (within-unit time-series) | ✗ | Requires fundamentally different dependence structure |
| Event studies / DiD | ✗ | Requires pre/post correlation within units |
Panel data support requires a different generative architecture (temporal copulas, hierarchical models, state-space approaches) and is a separate methodological direction for future work.
Marginal drift on the outcome
Layer 4 either shifts the outcome y (linear family) or resamples it (GLM family) to pin the target coefficients. Predictor marginals are preserved intact; the outcome marginal drifts by O(1/√N) as a consequence.
- Predictors (age, education, income, ...) are preserved.
- Outcomes may drift slightly in mean or SD as a result of the coefficient-matching step.
Rationale: predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim. If you report a summary-statistics table on the outcome variable, add a caveat noting that the mean and SD reflect the synthetic generation rather than the original.
Output files
After datamirror extract, the checkpoint_dir contains:
| File | Contents |
|---|---|
metadata.csv | Session info, version, timestamp |
schema.csv | varname, type, format, storage, is_integer |
marginals_cont.csv | 101-point quantile distributions for continuous variables |
marginals_cat.csv | Frequency tables for categorical variables |
correlations.csv | Pairwise correlation matrix |
checkpoints.csv | Checkpoint registry (tag, notes, cmd, model, …) |
checkpoints_coef.csv | Target coefficients, long format with cp_num foreign key |
manifest.csv | Self-describing listing of every file with row counts and descriptions |
All files are plain CSV. They can be reviewed by a data-owner before export and moved out of a secure environment through any standard file-transfer channel.
See also
- datamirror overview — positioning and the four-layer architecture at a glance
- Fidelity tiers — the four-layer invariant ladder
- Install guide — including secure-environment setup
- Citation — or run
datamirror citefor a version-pinned block