datamirror Stata reference

Install

Requires Stata 16.0 or later.

net install registream, from("https://registream.org/install/stata/latest") replace

net install registream installs core + every module (autolabel + datamirror). If you only want datamirror, install just that one: net install datamirror pulls core in as a dependency automatically. Full install notes and first-run wizard: install guide.

Quick start

* 1. Load your sensitive data
use "my_confidential_data.dta", clear

* 2. Initialize a datamirror session
datamirror init, checkpoint_dir("output") replace

* 3. Run your key analyses and checkpoint the results
reg employed age female
datamirror checkpoint, tag("employment_model")

reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")

* 4. Extract metadata (distributions + checkpoint targets)
datamirror extract, replace

* 5. Generate synthetic data from the metadata
datamirror rebuild using "output", clear seed(12345)

* 6. Validate fidelity
datamirror check using "output"
*   ✓ employment_model: Max Δβ/SE = 0.02
*   ✓ wellbeing_model:  Max Δβ/SE = 0.05

The synthetic dataset reproduces the original coefficients within Δβ/SE < 3 — inside the Bonferroni-corrected 99% simultaneous CI. You can re-run the same analysis script outside the secure environment and get the same estimates.

Workflow

The standard datamirror flow has two phases running in different environments:

┌─────────────────────────────────────────────┐
│ EXTRACT PHASE  (inside the secure env)      │
│                                             │
│  Original Data                              │
│     ↓                                       │
│  datamirror init + checkpoint + extract     │
│     ↓                                       │
│  Metadata files:                            │
│    • schema.csv                             │
│    • marginals_cont.csv                     │
│    • marginals_cat.csv                      │
│    • correlations.csv                       │
│    • checkpoints.csv                        │
│    • checkpoints_coef.csv                   │
│    • manifest.csv                           │
└─────────────────────────────────────────────┘
                     ↓  transfer (small files)
┌─────────────────────────────────────────────┐
│ REBUILD PHASE  (outside the secure env)     │
│                                             │
│  datamirror rebuild using             │
│     ↓                                       │
│  Synthetic dataset:                         │
│    • Marginals match (KS p > 0.05)          │
│    • Correlations match (r > 0.95)          │
│    • Stratification preserved               │
│    • Checkpoint β match (Δβ/SE < 3)  ⭐     │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│ VALIDATION  (anywhere)                      │
│                                             │
│  datamirror check using               │
│     ↓                                       │
│  Fidelity report                            │
└─────────────────────────────────────────────┘

The metadata files (schema, marginals, correlations, checkpoints) are small plain-text CSVs that respect statistical disclosure control — small cells are suppressed automatically. They can be reviewed by a data-owner before export and moved out of the secure environment through whatever channel the agency allows.

Command — `datamirror init`

Initialize a datamirror session and specify the output directory.

datamirror init, checkpoint_dir(string) [strata(varname) replace]

Options

checkpoint_dir(string) — directory where metadata files will be written. Required.
strata(varname) — stratification variable (optional). When specified, all properties are preserved within strata — useful for panel data and treatment/control groups.
replace — overwrite any existing contents of checkpoint_dir.

datamirror init, checkpoint_dir("output") strata(wave) replace

Command — `datamirror checkpoint`

Save the current model's results as a replication target.

datamirror checkpoint, tag(string) [notes(string)]

Requirements

Must be called immediately after an estimation command (regress, reghdfe, logit, …).
Results are read from Stata's e() macros.

Options

tag(string) — unique identifier for this checkpoint (required).
notes(string) — optional human-readable description.

Supported estimation commands

regress — OLS
reghdfe — fixed effects
ivregress 2sls — 2SLS (single-checkpoint and joint across shared-outcome groups)
logit, probit — binary outcomes
poisson, nbreg — count outcomes

Unsupported commands (xtreg, ologit, mlogit, stcox, tobit, ivreghdfe, ...) exit datamirror checkpoint cleanly with rc=199. A subset is on the v1.1 roadmap.

reg employed age female if !missing(employed, age, female)
datamirror checkpoint, tag("model1") notes("Employment regression")

reghdfe wellbeing age employed, absorb(id wave)
datamirror checkpoint, tag("fe_model") notes("FE with slopes")

Command — `datamirror extract`

Extract all metadata from the currently-loaded dataset.

datamirror extract [, replace]

Output files (written to `checkpoint_dir`)

metadata.csv — session info
schema.csv — variable types, labels, and integer-detection flags
marginals_cont.csv — continuous-variable quantile distributions
marginals_cat.csv — categorical-variable frequency tables
correlations.csv — correlation matrix
checkpoints.csv — checkpoint registry (one row per tagged regression)
checkpoints_coef.csv — target coefficients (one row per coefficient; cp_num foreign key)
manifest.csv — self-describing file listing with row counts

Command — `datamirror rebuild`

Generate synthetic data from the extracted metadata.

datamirror rebuild using path [, clear seed(#)]

Options

clear — replace the currently-loaded dataset.
seed(#) — random seed for reproducibility.

What happens

Load the on-disk metadata from path (schema, marginals, correlations, checkpoints).
Draw a Gaussian copula sample at the seeded random state.
Transform each dimension to its Layer 1 marginal; apply within strata if Layer 3 was used.
Layer 4: pin the checkpointed coefficients. Linear estimators use a closed-form Newton step; shared-outcome IV groups use a joint stacked Newton step; GLMs redraw y from the canonical DGP at the target coefficients.
Return a synthetic dataset whose regressions reproduce the checkpointed estimates within Δβ/SE < 3.

datamirror rebuild using "output", clear seed(12345)

Command — `datamirror check`

Validate synthetic-data fidelity against the original checkpoints.

datamirror check using path

Validations

Marginal distributions — Kolmogorov–Smirnov tests per variable
Correlations — Pearson correlations on the copula sample
Checkpoint coefficients — target Δβ/SE < 3 (Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint)

Output

──────────────────────────────────────────────────────────
Checkpoint 1: employment_model
──────────────────────────────────────────────────────────
  Variable        Original    Synthetic      Δβ      Δβ/SE
  ───────────────────────────────────────────────────────
  age             -0.0064     -0.0064      0.0000    0.00
  female          -0.1128     -0.1128      0.0000    0.00
  _cons            0.8973      0.8973      0.0000    0.00
  ───────────────────────────────────────────────────────
  Max Δβ/SE = 0.00
  ✓ PASS (Δβ/SE < 3)

The four-layer architecture

datamirror preserves fidelity through four constraint layers plus schema metadata:

Schema (metadata)

Variable types (numeric, categorical, string)
Value labels
Variable labels
Storage formats

Layer 1 — Marginal distributions

Continuous: 101-point quantile distributions (p0, p1, …, p100). Variables containing only integer values in the original data are auto-detected and rounded to integers in the synthetic data.
Categorical: complete frequency tables for all values.

Layer 2 — Correlation structure

Gaussian copula captures pairwise dependencies.
Preserves correlations with r > 0.95.

Layer 3 — Stratification

All properties hold within strata (e.g. by year, region, wave, treatment group).
Enables panel / longitudinal data synthesis with independent distributions per stratum.

Layer 4 — Checkpoint constraints ⭐

Two principled methods, one per family, with no learning rates or iteration knobs.

Linear estimators (OLS, FE, 2SLS). β̂ is a linear functional of y, so the exact shift that moves β̂ to the target β* is computable in closed form: delta_y = X · (β* − β̂) via Stata's matrix score. One step, no iteration.
Shared-outcome 2SLS groups. When multiple IV specifications share an outcome (e.g. main-shock and gender-shock specs on the same dependent variable), a joint stacked Newton step pins every coefficient constraint simultaneously.
Generalized linear models (logit, probit, poisson, nbreg). y is resampled from the canonical data-generating process at the target coefficients. Factor-level coefficients are preserved by construction.

Fidelity target: Δβ/SE < 3 (Bonferroni-corrected 99% simultaneous CI over the coefficients in each checkpoint). Validated on four AEA replication packages: 390+ coefficient comparisons, none exceed the threshold.

Rare-binary caveat. When a binary outcome has prevalence below ~0.10, the Gaussian copula does not preserve its correlations with other variables well. datamirror extract emits a warning in this regime; the regression will still run, but Δβ/SE may be elevated for coefficients involving the rare binary. A v1.1 copula refinement is planned.

Supported estimation families (Layer 4)

The fidelity target for every command is Δβ/SE < 3. Unsupported commands exit datamirror checkpoint cleanly with a pointer to this page.

Family	Commands	Method
Linear	`regress`, `reghdfe`	Closed-form Newton via `matrix score`
2SLS	`ivregress 2sls` (single + joint)	Weighted-FWL Newton; joint stack for shared outcomes
Binary	`logit`, `probit`	Direct Bernoulli DGP at target coefficients
Count	`poisson`, `nbreg`	Direct Poisson / Gamma-Poisson DGP at target coefficients

Privacy & statistical disclosure control

datamirror implements Statistical Disclosure Control (SDC) to ensure that checkpoint metadata cannot be used to identify small groups or individuals. All privacy controls are governed by a single global parameter, DM_MIN_CELL_SIZE.

Default: 50 — the strictest threshold of any major national statistical agency. Groups or strata with fewer than 50 observations are automatically suppressed from the checkpoint metadata.

Threshold	Use case	Agencies
50	Maximum safety — multi-agency compliance	All (strictest)
20	Standard microdata release	Statistics Sweden, Eurostat
10	Internal use within secure environment	US Census Bureau, UK ONS, Eurostat (tables)
5	Not recommended	Below most microdata standards

Change by setting the global before datamirror extract:

global DM_MIN_CELL_SIZE = 20
datamirror extract, replace

Limitations

Scope: cross-sectional microdata

datamirror v1 targets cross-sectional economic microdata, which represents the majority of Stata-based empirical workflows.

Design	Support	Notes
Cross-sectional RCT	✓ Full	Primary use case
Balance checks	✓ Full	Supported via standard OLS / FE checkpoints
OLS / FE regression	✓ Full	Supported
Treatment effects	✓ Full	With stratification for interactions
IV (distinct instruments)	✓ Full	Supported
IV (correlated instruments)	∼ Partial	Known limitation — methodological work in progress
Panel data (within-unit time-series)	✗	Requires fundamentally different dependence structure
Event studies / DiD	✗	Requires pre/post correlation within units

Panel data support requires a different generative architecture (temporal copulas, hierarchical models, state-space approaches) and is a separate methodological direction for future work.

Marginal drift on the outcome

Layer 4 either shifts the outcome y (linear family) or resamples it (GLM family) to pin the target coefficients. Predictor marginals are preserved intact; the outcome marginal drifts by O(1/√N) as a consequence.

Predictors (age, education, income, ...) are preserved.
Outcomes may drift slightly in mean or SD as a result of the coefficient-matching step.

Rationale: predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim. If you report a summary-statistics table on the outcome variable, add a caveat noting that the mean and SD reflect the synthetic generation rather than the original.

Output files

After datamirror extract, the checkpoint_dir contains:

File	Contents
`metadata.csv`	Session info, version, timestamp
`schema.csv`	`varname, type, format, storage, is_integer`
`marginals_cont.csv`	101-point quantile distributions for continuous variables
`marginals_cat.csv`	Frequency tables for categorical variables
`correlations.csv`	Pairwise correlation matrix
`checkpoints.csv`	Checkpoint registry (`tag`, `notes`, `cmd`, `model`, …)
`checkpoints_coef.csv`	Target coefficients, long format with `cp_num` foreign key
`manifest.csv`	Self-describing listing of every file with row counts and descriptions

All files are plain CSV. They can be reviewed by a data-owner before export and moved out of a secure environment through any standard file-transfer channel.

Author

Jeffrey Clark

PhD Student, Economics

Stockholm University

Install

Quick start

Workflow

Command — datamirror init

Options

Command — datamirror checkpoint

Requirements

Options

Supported estimation commands

Command — datamirror extract

Output files (written to checkpoint_dir)

Command — datamirror rebuild

Options

What happens

Command — datamirror check

Validations

Output

The four-layer architecture

Schema (metadata)

Layer 1 — Marginal distributions

Layer 2 — Correlation structure

Layer 3 — Stratification

Layer 4 — Checkpoint constraints ⭐

Supported estimation families (Layer 4)

Privacy & statistical disclosure control

Limitations

Scope: cross-sectional microdata

Marginal drift on the outcome

Output files

See also

Author

Jeffrey Clark

Command — `datamirror init`

Command — `datamirror checkpoint`

Command — `datamirror extract`

Output files (written to `checkpoint_dir`)

Command — `datamirror rebuild`

Command — `datamirror check`