1. Docs
  2. datamirror
  3. Stata

datamirror · Stata

Stata reference

Complete command reference for datamirror in Stata — workflow, commands, the four-layer architecture, supported estimation families, and privacy controls.

Install

Requires Stata 16.0 or later. Install the registream core first, then datamirror:

net install registream, from("https://registream.org/install/stata/registream/latest") replace
net install datamirror, from("https://registream.org/install/stata/datamirror/latest") replace

datamirror requires registream (the shared core: configuration, first-run wizard, update management) and prints install instructions at runtime if it's missing. Full install notes and first-run wizard: install guide.

Quick start

* 1. Load your sensitive data
use "my_confidential_data.dta", clear

* 2. Initialize a datamirror session
datamirror init, checkpoint_dir("output") replace

* 3. Run your key analyses and checkpoint the results
reg employed age female
datamirror checkpoint, tag("employment_model")

reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")

* 4. Extract metadata (distributions + checkpoint targets)
datamirror extract, replace

* 5. Generate synthetic data from the metadata
datamirror rebuild using "output", clear seed(12345)

* 6. Validate fidelity
datamirror check using "output"
*   ✓ employment_model: Max Δβ/SE = 0.02
*   ✓ wellbeing_model:  Max Δβ/SE = 0.05

The synthetic dataset reproduces the original coefficients within Δβ/SE < 3, the 99.7% single-coefficient band used as a fixed diagnostic tolerance. You can re-run the same analysis script outside the secure environment and get the same estimates.

Workflow

The standard datamirror flow has two phases running in different environments:

┌─────────────────────────────────────────────┐
│ EXTRACT PHASE  (inside the secure env)      │
│                                             │
│  Original Data                              │
│     ↓                                       │
│  datamirror init + checkpoint + extract     │
│     ↓                                       │
│  Metadata files:                            │
│    • schema.csv                             │
│    • marginals_cont.csv                     │
│    • marginals_cat.csv                      │
│    • correlations.csv                       │
│    • checkpoints.csv                        │
│    • checkpoints_coef.csv                   │
│    • manifest.csv                           │
└─────────────────────────────────────────────┘
                     ↓  transfer (small files)
┌─────────────────────────────────────────────┐
│ REBUILD PHASE  (outside the secure env)     │
│                                             │
│  datamirror rebuild using             │
│     ↓                                       │
│  Synthetic dataset:                         │
│    • Marginals match (KS p > 0.05)          │
│    • Correlations match (r > 0.95)          │
│    • Stratification preserved               │
│    • Checkpoint β match (Δβ/SE < 3)  ⭐     │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│ VALIDATION  (anywhere)                      │
│                                             │
│  datamirror check using               │
│     ↓                                       │
│  Fidelity report                            │
└─────────────────────────────────────────────┘

The metadata files (schema, marginals, correlations, checkpoints) are small plain-text CSVs that respect statistical disclosure control — small cells are suppressed automatically. They can be reviewed by a data-owner before export and moved out of the secure environment through whatever channel the agency allows.

Command — datamirror init

Initialize a datamirror session and specify the output directory.

datamirror init, checkpoint_dir(string) [strata(varname) replace]

Options

  • checkpoint_dir(string) — directory where metadata files will be written. Required.
  • strata(varname) — stratification variable (optional). When specified, all properties are preserved within strata — useful for panel data and treatment/control groups.
  • replace — overwrite any existing contents of checkpoint_dir.
datamirror init, checkpoint_dir("output") strata(wave) replace

Command — datamirror checkpoint

Save the current model's results as a replication target.

datamirror checkpoint, tag(string) [notes(string)]

Requirements

  • Must be called immediately after an estimation command (regress, reghdfe, logit, …).
  • Results are read from Stata's e() macros.

Options

  • tag(string) — unique identifier for this checkpoint (required).
  • notes(string) — optional human-readable description.

Supported estimation commands

  • regress — OLS
  • reghdfe — fixed effects
  • ivregress 2sls — 2SLS (single-checkpoint and joint across shared-outcome groups)
  • logit, logistic, probit — binary outcomes
  • poisson, nbreg — count outcomes

Unsupported commands (xtreg, ologit, mlogit, stcox, tobit, ivreghdfe, ...) exit datamirror checkpoint cleanly with rc=199. A subset is on the v1.1 roadmap.

reg employed age female if !missing(employed, age, female)
datamirror checkpoint, tag("model1") notes("Employment regression")

reghdfe wellbeing age employed, absorb(id wave)
datamirror checkpoint, tag("fe_model") notes("FE with slopes")

Command — datamirror extract

Extract all metadata from the currently-loaded dataset.

datamirror extract [, replace]

Output files (written to checkpoint_dir)

  • metadata.csv — session info
  • schema.csv — variable types, labels, and integer-detection flags
  • marginals_cont.csv — continuous-variable quantile distributions
  • marginals_cat.csv — categorical-variable frequency tables
  • correlations.csv — correlation matrix
  • checkpoints.csv — checkpoint registry (one row per tagged regression)
  • checkpoints_coef.csv — target coefficients (one row per coefficient; cp_num foreign key)
  • manifest.csv — self-describing file listing with row counts

Command — datamirror rebuild

Generate synthetic data from the extracted metadata.

datamirror rebuild using path [, clear seed(#) n(#) scale(#) verify]

Options

  • clear — replace the currently-loaded dataset.
  • seed(#) — random seed for reproducibility.
  • n(#) — redraw at # observations instead of the bundle's recorded N; relationships and pinned coefficients are preserved at the requested size.
  • scale(#) — redraw at a multiple of the recorded N rather than an absolute count (scale(0.001) draws one-thousandth, scale(2) draws double). Mutually exclusive with n(#); both resolve to the same target size.
  • verify — after rebuild, re-run every checkpoint and report Δβ/SE.

What happens

  1. Load the on-disk metadata from path (schema, marginals, correlations, checkpoints).
  2. Draw a Gaussian copula sample at the seeded random state.
  3. Transform each dimension to its Layer 1 marginal; apply within strata if Layer 3 was used.
  4. Layer 4: pin the checkpointed coefficients. Linear estimators use a closed-form Newton step; shared-outcome IV groups use a joint stacked Newton step; GLMs redraw y from the canonical DGP at the target coefficients.
  5. Return a synthetic dataset whose regressions reproduce the checkpointed estimates within Δβ/SE < 3.
datamirror rebuild using "output", clear seed(12345)

Command — datamirror check

Validate synthetic-data fidelity against the original checkpoints.

datamirror check using path [, saving(filename)]

saving(filename) exports the per-coefficient fidelity table to CSV (used by the replication verify.do scripts).

Validations

  • Marginal distributions — Kolmogorov–Smirnov tests per variable
  • Correlations — Spearman rank correlations on the copula sample
  • Checkpoint coefficients — target Δβ/SE < 3 (a fixed diagnostic tolerance; 3 is the 99.7% single-coefficient band)

Output

──────────────────────────────────────────────────────────
Checkpoint 1: employment_model
──────────────────────────────────────────────────────────
  Variable        Original    Synthetic      Δβ      Δβ/SE
  ───────────────────────────────────────────────────────
  age             -0.0064     -0.0064      0.0000    0.00
  female          -0.1128     -0.1128      0.0000    0.00
  _cons            0.8973      0.8973      0.0000    0.00
  ───────────────────────────────────────────────────────
  Max Δβ/SE = 0.00
  ✓ PASS (Δβ/SE < 3)

The four-layer architecture

datamirror preserves fidelity through four constraint layers plus schema metadata:

Schema (metadata)

  • Variable types (numeric, categorical, string)
  • Value labels
  • Variable labels
  • Storage formats

Layer 1 — Marginal distributions

  • Continuous: 101-point quantile distributions (p0, p1, …, p100). Variables containing only integer values in the original data are auto-detected and rounded to integers in the synthetic data.
  • Categorical: complete frequency tables for all values.

Layer 2 — Correlation structure

  • Gaussian copula captures pairwise dependencies.
  • Preserves correlations with r > 0.95.

Layer 3 — Stratification

  • All properties hold within strata (e.g. by year, region, wave, treatment group).
  • Enables per-wave (cross-sectional) synthesis with independent distributions per stratum. Within-unit time-series dependence is not modeled (see Limitations).

Layer 4 — Checkpoint constraints ⭐

Two principled methods, one per family, with no learning rates or iteration knobs.

  • Linear estimators (OLS, FE, 2SLS). β̂ is a linear functional of y, so the exact shift that moves β̂ to the target β* is computable in closed form: delta_y = X · (β* − β̂) via Stata's matrix score. One step, no iteration.
  • Shared-outcome 2SLS groups. When multiple IV specifications share an outcome (e.g. main-shock and gender-shock specs on the same dependent variable), a joint stacked Newton step pins every coefficient constraint simultaneously.
  • Generalized linear models (logit, probit, poisson, nbreg). y is resampled from the canonical data-generating process at the target coefficients. Factor-level coefficients are preserved by construction.

Fidelity target: Δβ/SE < 3 (a fixed diagnostic tolerance; 3 is the 99.7% single-coefficient band). Validated on four AEA replication packages; the checkpointed regressions clear the threshold.

Rare-binary caveat. When a binary outcome has prevalence below ~0.10, the Gaussian copula does not preserve its correlations with other variables well. datamirror extract emits a warning in this regime; the regression will still run, but Δβ/SE may be elevated for coefficients involving the rare binary. A v1.1 copula refinement is planned.

Supported estimation families (Layer 4)

The fidelity target for every command is Δβ/SE < 3. Unsupported commands exit datamirror checkpoint cleanly with a pointer to this page.

Family Commands Method
Linear regress, reghdfe Closed-form Newton via matrix score
2SLS ivregress 2sls (single + joint) Weighted-FWL Newton; joint stack for shared outcomes
Binary logit, probit Direct Bernoulli DGP at target coefficients
Count poisson, nbreg Direct Poisson / Gamma-Poisson DGP at target coefficients

Privacy & statistical disclosure control

datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. Three controls govern the output, each resolved the same way (init option > registream config > package default):

  • min_cell_size (default 50) — suppress categorical cells and strata with fewer observations than the threshold.
  • quantile_trim (default 1) — top/bottom-code continuous extremes so the synthetic support is [p01, p99], not the raw min/max.
  • min_resid_df (default 10) — minimum residual degrees of freedom for a regression checkpoint.

The default cell size of 50 is the strictest threshold of any major national statistical agency.

ThresholdUse caseAgencies
50 Maximum safety — multi-agency compliance All (strictest)
20 Standard microdata release Statistics Sweden, Eurostat
10 Internal use within secure environment US Census Bureau, UK ONS, Eurostat (tables)
5 Not recommended Below most microdata standards

Set a control for one extract via the init option, or persist it across sessions via the registream config:

datamirror init, checkpoint_dir("output") replace min_cell_size(20)
registream config, dm_min_cell_size(20)   // persistent

Regression checkpoints are gated at capture time: a checkpoint is rejected if its sample N is below min_cell_size or its residual df is below min_resid_df, and any coefficient describing a sub-min_cell_size group is dropped while the rest of the regression is kept. The resolved thresholds and the suppression counts (including n_coef_suppressed) are recorded in metadata.csv. Full detail on the Privacy & SDC page.

Limitations

Scope: cross-sectional microdata

datamirror v1 targets cross-sectional economic microdata, which represents the majority of Stata-based empirical workflows.

DesignSupportNotes
Cross-sectional RCT ✓ Full Primary use case
Balance checks ✓ Full Supported via standard OLS / FE checkpoints
OLS / FE regression ✓ Full Supported
Treatment effects ✓ Full With stratification for interactions
IV (distinct instruments)✓ Full Supported
IV (correlated instruments)∼ PartialKnown limitation — methodological work in progress
Panel data (within-unit time-series)Requires fundamentally different dependence structure
Event studies / DiD Requires pre/post correlation within units

Panel data support requires a different generative architecture (temporal copulas, hierarchical models, state-space approaches) and is a separate methodological direction for future work.

Marginal drift on the outcome

Layer 4 either shifts the outcome y (linear family) or resamples it (GLM family) to pin the target coefficients. Predictor marginals are preserved intact; the outcome marginal drifts by O(1/√N) as a consequence.

  • Predictors (age, education, income, ...) are preserved.
  • Outcomes may drift slightly in mean or SD as a result of the coefficient-matching step.

Rationale: predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim. If you report a summary-statistics table on the outcome variable, add a caveat noting that the mean and SD reflect the synthetic generation rather than the original.

Output files

After datamirror extract, the checkpoint_dir contains:

FileContents
metadata.csv Session info, version, timestamp
schema.csv varname, type, format, storage, is_integer
marginals_cont.csv 101-point quantile distributions for continuous variables
marginals_cat.csv Frequency tables for categorical variables
correlations.csv Pairwise correlation matrix
checkpoints.csv Checkpoint registry (tag, notes, cmd, model, …)
checkpoints_coef.csv Target coefficients, long format with cp_num foreign key
manifest.csv Self-describing listing of every file with row counts and descriptions

All files are plain CSV. They can be reviewed by a data-owner before export and moved out of a secure environment through any standard file-transfer channel.

See also

Author

Jeffrey Clark

PhD Student, Economics

Stockholm University