datamirror · fidelity
Fidelity tiers
An open vocabulary for what "faithful synthetic data" means. Four nested tiers with documented convergence targets. Any synthetic-data tool can be evaluated against this ladder.
Why a fidelity vocabulary
Synthetic-data tools (synthpop, SDV, Gretel, Syntegra, Mostly AI, and others) all produce "synthetic data that matches the real data" — but what match means varies wildly. Distribution only? Correlations? Interaction effects? Regression results? Without a shared vocabulary, a paper that says "we validated on synthetic data" is claiming something unverifiable.
datamirror proposes a tiered fidelity vocabulary. Each tier has an operational definition, documented convergence targets, and an explicit scope. A researcher publishing a synthetic replication package can cite which tier their synthetic data meets; a journal can require a specific tier; a reviewer can verify the claim.
Schema (metadata) — the floor
Before any statistical layer, a fidelity-faithful synthetic dataset must replicate the data's structure:
- Variable types (numeric, categorical, string)
- Value labels (code → label mappings)
- Variable labels
- Storage formats (e.g. integer preservation — datamirror auto-detects when the source data contains only integer values and rounds the synthetic output to match)
Without this floor, the synthetic data isn't usable with the original code — a generate age = year - birthyear line fails silently if birthyear was stored as a string in the original but becomes a float in the synthetic version.
Layer 1 — Marginal distributions
Each variable's one-dimensional distribution matches the original.
Operational definition
- Continuous variables: match at all 101 quantile points (p0, p1, ..., p100). Kolmogorov–Smirnov test with
p > 0.05. - Categorical variables: match the frequency table exactly at all values observed in the original (post-SDC suppression — see privacy).
What this guarantees
Any analysis that only looks at one variable at a time (histograms, means, percentiles, summary statistics tables) produces statistically indistinguishable results on the synthetic data.
What this does NOT guarantee
Any two-variable relationship — correlations, crosstabs, regressions. A Layer-1-only synthetic dataset can perfectly match each variable's distribution while completely destroying all joint structure.
Layer 2 — Correlation structure
All pairwise numeric-variable correlations match the original.
Operational definition
- Pearson correlation between any two numeric variables matches within
r > 0.95of the original. - Implemented via a Gaussian copula: sample from a multivariate normal with the original correlation structure, then transform each dimension to match its Layer-1 marginal.
What this guarantees
Two-variable relationships that are linear and numeric are preserved. Basic OLS on two continuous predictors yields coefficients close (but not exact) to the original. Standard exploratory data analysis — correlation heatmaps, scatter plots — produces the same picture.
What this does NOT guarantee
Nonlinear relationships, interactions between three or more variables, within-subgroup patterns, or inferential-grade coefficient replication.
Layer 3 — Stratification
All properties hold within strata.
Operational definition
- Given a stratification variable (e.g. treatment group, wave, region), Layer 1 marginals and Layer 2 correlations are preserved within each stratum, not just aggregated.
- Implemented by fitting independent copulas per stratum and concatenating the samples.
- Compound stratification (e.g. treatment × wave) is supported via multi-variable strata.
What this guarantees
Subgroup analyses work. Treatment-effect estimates on a Layer-3 synthetic dataset produce coefficients close to the original for the treated vs. control groups separately. Interaction effects (treatment × moderator) are preserved when the moderator is part of the stratification.
What this does NOT guarantee
Exact replication of regression coefficients to inferential precision. Layers 1–3 together reproduce the shape of the data; the specific β values from any particular regression drift within a small but detectable range.
Layer 4 — Checkpoint constraints ⭐
The novel contribution. Regression coefficients from a declared set of "checkpoint" models match the original within a documented tolerance. Two principled methods, one for each family of estimator, with no learning rates or iteration knobs in either path.
Operational definition
- Researcher runs their key regressions on the original data inside the secure environment and calls
datamirror checkpoint, tag(...)after each one. - Layer 4 then produces an outcome vector
ysuch that re-running the exact same regression on the synthetic data recovers the checkpointed coefficients. - Linear estimators (OLS, fixed effects, 2SLS) use a closed-form Newton step. Because β̂ is a linear functional of y, shifting y by
X · (β* − β̂)updates β̂ by exactlyβ* − β̂. One step, exact up to floating-point precision. - Shared-outcome IV groups (multiple 2SLS specs on the same y) use a joint stacked Newton step that pins every coefficient constraint simultaneously.
- Generalized linear models (logit, probit, poisson, nbreg) draw y fresh from the canonical data-generating process at the target coefficients. Fitting the matching GLM on the resulting y recovers β* within
O(1/√N)sampling noise.
Fidelity target by model family
The convergence target is Δβ/SE < 3, where Δβ is the distance between the synthetic and target coefficient and SE is the synthetic regression's own standard error. This corresponds to a Bonferroni-corrected 99% simultaneous confidence region over the 3–5 coefficients per checkpoint the harness inspects. Validated on four AEA replication packages: 390+ coefficient comparisons, none exceed the threshold.
| Family | Commands | Method |
|---|---|---|
| Linear | regress, reghdfe | Closed-form Newton via matrix score |
| IV | ivregress 2sls (single + joint) | Weighted-FWL Newton; joint stack for shared outcomes |
| Binary | logit, probit | Direct Bernoulli DGP |
| Count | poisson, nbreg | Direct Poisson / Gamma-Poisson DGP |
Trade-off: outcome marginal drifts
For the linear family the Newton step adds a small vector to y; for the GLM family y is resampled from the target model. Predictor marginals (age, education, income, ...) are preserved in both cases. Outcome marginals may drift by O(1/√N) from the original — this is intentional and by design. Predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim.
vs. other synthetic-data tools
Other synthetic-data generators stop at Layer 1, 2, or 3. datamirror's Layer 4 adds an explicit coefficient-preservation constraint, aimed at making synthetic data usable for inferential replication in register research.
| Tool | Language | Typical tier | Note |
|---|---|---|---|
synthpop | R | Layer 1–2 | CART-based synthesis; preserves marginals + bivariate structure well |
| SDV (Gaussian copula) | Python | Layer 2 | Correlation-preserving; no stratification by default |
| SDV (CTGAN) | Python | Layer 1–2 | Deep-learning; strong marginals, weaker joint structure |
| Gretel / Mostly AI | Commercial | Layer 1–3 | Stratification via conditional generation |
| datamirror | Stata (v1) | Layer 4 | Explicit coefficient-preservation constraint via checkpointed estimation |
Characterizations are approximate and reflect the publicly-documented strongest claim for each tool. A careful comparison paper is forthcoming.
How to cite a tier
A paper or replication package using synthetic data can state which tier their synthesis meets:
A shorter form for a methods note:
datamirror check prints the achieved Δβ/SE per checkpoint so you have a concrete number to cite.
See also
- datamirror overview
- Stata reference — commands and workflow
- Privacy & SDC — how checkpoint output is kept non-disclosive