1. Docs
  2. datamirror
  3. Stata

datamirror · Stata

Stata reference

Complete command reference for datamirror in Stata — workflow, commands, the four-layer architecture, supported estimation families, and privacy controls.

Install

Requires Stata 16.0 or later.

net install registream, from("https://registream.org/install/stata/latest") replace

net install registream installs core + every module (autolabel + datamirror). If you only want datamirror, install just that one: net install datamirror pulls core in as a dependency automatically. Full install notes and first-run wizard: install guide.

Quick start

* 1. Load your sensitive data
use "my_confidential_data.dta", clear

* 2. Initialize a datamirror session
datamirror init, checkpoint_dir("output") replace

* 3. Run your key analyses and checkpoint the results
reg employed age female
datamirror checkpoint, tag("employment_model")

reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")

* 4. Extract metadata (distributions + checkpoint targets)
datamirror extract, replace

* 5. Generate synthetic data from the metadata
datamirror rebuild using "output", clear seed(12345)

* 6. Validate fidelity
datamirror check using "output"
*   ✓ employment_model: Max Δβ/SE = 0.02
*   ✓ wellbeing_model:  Max Δβ/SE = 0.05

The synthetic dataset reproduces the original coefficients within Δβ/SE < 3 — inside the Bonferroni-corrected 99% simultaneous CI. You can re-run the same analysis script outside the secure environment and get the same estimates.

Workflow

The standard datamirror flow has two phases running in different environments:

┌─────────────────────────────────────────────┐
│ EXTRACT PHASE  (inside the secure env)      │
│                                             │
│  Original Data                              │
│     ↓                                       │
│  datamirror init + checkpoint + extract     │
│     ↓                                       │
│  Metadata files:                            │
│    • schema.csv                             │
│    • marginals_cont.csv                     │
│    • marginals_cat.csv                      │
│    • correlations.csv                       │
│    • checkpoints.csv                        │
│    • checkpoints_coef.csv                   │
│    • manifest.csv                           │
└─────────────────────────────────────────────┘
                     ↓  transfer (small files)
┌─────────────────────────────────────────────┐
│ REBUILD PHASE  (outside the secure env)     │
│                                             │
│  datamirror rebuild using             │
│     ↓                                       │
│  Synthetic dataset:                         │
│    • Marginals match (KS p > 0.05)          │
│    • Correlations match (r > 0.95)          │
│    • Stratification preserved               │
│    • Checkpoint β match (Δβ/SE < 3)  ⭐     │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│ VALIDATION  (anywhere)                      │
│                                             │
│  datamirror check using               │
│     ↓                                       │
│  Fidelity report                            │
└─────────────────────────────────────────────┘

The metadata files (schema, marginals, correlations, checkpoints) are small plain-text CSVs that respect statistical disclosure control — small cells are suppressed automatically. They can be reviewed by a data-owner before export and moved out of the secure environment through whatever channel the agency allows.

Command — datamirror init

Initialize a datamirror session and specify the output directory.

datamirror init, checkpoint_dir(string) [strata(varname) replace]

Options

  • checkpoint_dir(string) — directory where metadata files will be written. Required.
  • strata(varname) — stratification variable (optional). When specified, all properties are preserved within strata — useful for panel data and treatment/control groups.
  • replace — overwrite any existing contents of checkpoint_dir.
datamirror init, checkpoint_dir("output") strata(wave) replace

Command — datamirror checkpoint

Save the current model's results as a replication target.

datamirror checkpoint, tag(string) [notes(string)]

Requirements

  • Must be called immediately after an estimation command (regress, reghdfe, logit, …).
  • Results are read from Stata's e() macros.

Options

  • tag(string) — unique identifier for this checkpoint (required).
  • notes(string) — optional human-readable description.

Supported estimation commands

  • regress — OLS
  • reghdfe — fixed effects
  • ivregress 2sls — 2SLS (single-checkpoint and joint across shared-outcome groups)
  • logit, probit — binary outcomes
  • poisson, nbreg — count outcomes

Unsupported commands (xtreg, ologit, mlogit, stcox, tobit, ivreghdfe, ...) exit datamirror checkpoint cleanly with rc=199. A subset is on the v1.1 roadmap.

reg employed age female if !missing(employed, age, female)
datamirror checkpoint, tag("model1") notes("Employment regression")

reghdfe wellbeing age employed, absorb(id wave)
datamirror checkpoint, tag("fe_model") notes("FE with slopes")

Command — datamirror extract

Extract all metadata from the currently-loaded dataset.

datamirror extract [, replace]

Output files (written to checkpoint_dir)

  • metadata.csv — session info
  • schema.csv — variable types, labels, and integer-detection flags
  • marginals_cont.csv — continuous-variable quantile distributions
  • marginals_cat.csv — categorical-variable frequency tables
  • correlations.csv — correlation matrix
  • checkpoints.csv — checkpoint registry (one row per tagged regression)
  • checkpoints_coef.csv — target coefficients (one row per coefficient; cp_num foreign key)
  • manifest.csv — self-describing file listing with row counts

Command — datamirror rebuild

Generate synthetic data from the extracted metadata.

datamirror rebuild using path [, clear seed(#)]

Options

  • clear — replace the currently-loaded dataset.
  • seed(#) — random seed for reproducibility.

What happens

  1. Load the on-disk metadata from path (schema, marginals, correlations, checkpoints).
  2. Draw a Gaussian copula sample at the seeded random state.
  3. Transform each dimension to its Layer 1 marginal; apply within strata if Layer 3 was used.
  4. Layer 4: pin the checkpointed coefficients. Linear estimators use a closed-form Newton step; shared-outcome IV groups use a joint stacked Newton step; GLMs redraw y from the canonical DGP at the target coefficients.
  5. Return a synthetic dataset whose regressions reproduce the checkpointed estimates within Δβ/SE < 3.
datamirror rebuild using "output", clear seed(12345)

Command — datamirror check

Validate synthetic-data fidelity against the original checkpoints.

datamirror check using path

Validations

  • Marginal distributions — Kolmogorov–Smirnov tests per variable
  • Correlations — Pearson correlations on the copula sample
  • Checkpoint coefficients — target Δβ/SE < 3 (Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint)

Output

──────────────────────────────────────────────────────────
Checkpoint 1: employment_model
──────────────────────────────────────────────────────────
  Variable        Original    Synthetic      Δβ      Δβ/SE
  ───────────────────────────────────────────────────────
  age             -0.0064     -0.0064      0.0000    0.00
  female          -0.1128     -0.1128      0.0000    0.00
  _cons            0.8973      0.8973      0.0000    0.00
  ───────────────────────────────────────────────────────
  Max Δβ/SE = 0.00
  ✓ PASS (Δβ/SE < 3)

The four-layer architecture

datamirror preserves fidelity through four constraint layers plus schema metadata:

Schema (metadata)

  • Variable types (numeric, categorical, string)
  • Value labels
  • Variable labels
  • Storage formats

Layer 1 — Marginal distributions

  • Continuous: 101-point quantile distributions (p0, p1, …, p100). Variables containing only integer values in the original data are auto-detected and rounded to integers in the synthetic data.
  • Categorical: complete frequency tables for all values.

Layer 2 — Correlation structure

  • Gaussian copula captures pairwise dependencies.
  • Preserves correlations with r > 0.95.

Layer 3 — Stratification

  • All properties hold within strata (e.g. by year, region, wave, treatment group).
  • Enables panel / longitudinal data synthesis with independent distributions per stratum.

Layer 4 — Checkpoint constraints ⭐

Two principled methods, one per family, with no learning rates or iteration knobs.

  • Linear estimators (OLS, FE, 2SLS). β̂ is a linear functional of y, so the exact shift that moves β̂ to the target β* is computable in closed form: delta_y = X · (β* − β̂) via Stata's matrix score. One step, no iteration.
  • Shared-outcome 2SLS groups. When multiple IV specifications share an outcome (e.g. main-shock and gender-shock specs on the same dependent variable), a joint stacked Newton step pins every coefficient constraint simultaneously.
  • Generalized linear models (logit, probit, poisson, nbreg). y is resampled from the canonical data-generating process at the target coefficients. Factor-level coefficients are preserved by construction.

Fidelity target: Δβ/SE < 3 (Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint). Validated on four AEA replication packages: 390+ coefficient comparisons, none exceed the threshold.

Rare-binary caveat. When a binary outcome has prevalence below ~0.10, the Gaussian copula does not preserve its correlations with other variables well. datamirror extract emits a warning in this regime; the regression will still run, but Δβ/SE may be elevated for coefficients involving the rare binary. A v1.1 copula refinement is planned.

Supported estimation families (Layer 4)

The fidelity target for every command is Δβ/SE < 3. Unsupported commands exit datamirror checkpoint cleanly with a pointer to this page.

Family Commands Method
Linear regress, reghdfe Closed-form Newton via matrix score
2SLS ivregress 2sls (single + joint) Weighted-FWL Newton; joint stack for shared outcomes
Binary logit, probit Direct Bernoulli DGP at target coefficients
Count poisson, nbreg Direct Poisson / Gamma-Poisson DGP at target coefficients

Privacy & statistical disclosure control

datamirror implements Statistical Disclosure Control (SDC) to ensure that checkpoint metadata cannot be used to identify small groups or individuals. All privacy controls are governed by a single global parameter, DM_MIN_CELL_SIZE.

Default: 50 — the strictest threshold of any major national statistical agency. Groups or strata with fewer than 50 observations are automatically suppressed from the checkpoint metadata.

ThresholdUse caseAgencies
50 Maximum safety — multi-agency compliance All (strictest)
20 Standard microdata release Statistics Sweden, Eurostat
10 Internal use within secure environment US Census Bureau, UK ONS, Eurostat (tables)
5 Not recommended Below most microdata standards

Change by setting the global before datamirror extract:

global DM_MIN_CELL_SIZE = 20
datamirror extract, replace

Limitations

Scope: cross-sectional microdata

datamirror v1 targets cross-sectional economic microdata, which represents the majority of Stata-based empirical workflows.

DesignSupportNotes
Cross-sectional RCT ✓ Full Primary use case
Balance checks ✓ Full Supported via standard OLS / FE checkpoints
OLS / FE regression ✓ Full Supported
Treatment effects ✓ Full With stratification for interactions
IV (distinct instruments)✓ Full Supported
IV (correlated instruments)∼ PartialKnown limitation — methodological work in progress
Panel data (within-unit time-series)Requires fundamentally different dependence structure
Event studies / DiD Requires pre/post correlation within units

Panel data support requires a different generative architecture (temporal copulas, hierarchical models, state-space approaches) and is a separate methodological direction for future work.

Marginal drift on the outcome

Layer 4 either shifts the outcome y (linear family) or resamples it (GLM family) to pin the target coefficients. Predictor marginals are preserved intact; the outcome marginal drifts by O(1/√N) as a consequence.

  • Predictors (age, education, income, ...) are preserved.
  • Outcomes may drift slightly in mean or SD as a result of the coefficient-matching step.

Rationale: predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim. If you report a summary-statistics table on the outcome variable, add a caveat noting that the mean and SD reflect the synthetic generation rather than the original.

Output files

After datamirror extract, the checkpoint_dir contains:

FileContents
metadata.csv Session info, version, timestamp
schema.csv varname, type, format, storage, is_integer
marginals_cont.csv 101-point quantile distributions for continuous variables
marginals_cat.csv Frequency tables for categorical variables
correlations.csv Pairwise correlation matrix
checkpoints.csv Checkpoint registry (tag, notes, cmd, model, …)
checkpoints_coef.csv Target coefficients, long format with cp_num foreign key
manifest.csv Self-describing listing of every file with row counts and descriptions

All files are plain CSV. They can be reviewed by a data-owner before export and moved out of a secure environment through any standard file-transfer channel.

See also

Author

Jeffrey Clark

PhD Student, Economics

Stockholm University