Module
datamirror
Coefficient-faithful synthetic data for register research. Develop outside the secure environment, validate inside. Four layers: marginals, correlations, stratification, and a coefficient-preservation layer that pins the regressions you care about.
What it does
Most synthetic-data tools stop at matching univariate and bivariate distributions. That is not enough for replication: the same regression run on such synthetic data drifts away from the published coefficients. datamirror adds a fourth layer whose only job is to make the checkpointed regressions return the same estimates.
* Inside the secure environment — checkpoint your real analysis
use confidential_data.dta, clear
datamirror init, checkpoint_dir("output") replace
reg wellbeing age i.education female
datamirror checkpoint, tag("wellbeing_model")
datamirror extract
* Outside — rebuild from the on-disk checkpoint directory
datamirror rebuild using "output", clear seed(12345)
* Re-run your exact analysis on synthetic data
reg wellbeing age i.education female
* → coefficients recover the target within Δβ/SE < 3The four layers
- Marginals — 101-point quantile distributions for continuous variables; frequency tables for categorical.
- Correlations — Gaussian copula captures pairwise dependencies.
- Stratification — optional: marginals and correlations held within strata (e.g. by wave, region, treatment).
- Coefficient preservation — the novel contribution. Linear estimators (OLS, fixed effects, 2SLS, joint 2SLS on shared outcomes) use a closed-form Newton step to pin the target coefficients exactly. Generalized linear models (logit, probit, poisson, negative binomial) draw the outcome from the canonical data-generating process at the target coefficients. Both arrive at synthetic data whose regressions match the original within
Δβ/SE < 3across a Bonferroni-corrected 99% simultaneous confidence region.
Validated on four published AEA replication packages (Duflo-Hanna-Ryan 2012, Dupas-Robinson 2013, Banerjee et al. 2015, Autor-Dorn-Hanson 2019): 390+ coefficient comparisons, none exceed the threshold.
Platforms
Live · v1.0
Stata
Shipped: the four-layer architecture with coefficient preservation for linear (OLS, FE, 2SLS, joint 2SLS), binary (logit, probit), and count (Poisson, negative binomial) estimators.
Stata docs →Roadmap
Python + R
Cross-platform ports after the Stata methods paper is submitted.
On the roadmap →