Fidelity tiers — RegiStream

Docs
datamirror
Fidelity tiers

Why a fidelity vocabulary

Synthetic-data tools (synthpop, SDV, Gretel, Syntegra, Mostly AI, and others) all produce "synthetic data that matches the real data" — but what match means varies wildly. Distribution only? Correlations? Interaction effects? Regression results? Without a shared vocabulary, a paper that says "we validated on synthetic data" is claiming something unverifiable.

datamirror proposes a tiered fidelity vocabulary. Each tier has an operational definition, documented convergence targets, and an explicit scope. A researcher publishing a synthetic replication package can cite which tier their synthetic data meets; a journal can require a specific tier; a reviewer can verify the claim.

The tiers nest. Layer 4 implies Layer 3 implies Layer 2 implies Layer 1 implies Schema. A tier-4 dataset automatically satisfies tier 1–3.

Schema (metadata) — the floor

Before any statistical layer, a fidelity-faithful synthetic dataset must replicate the data's structure:

Variable types (numeric, categorical, string)
Value labels (code → label mappings)
Variable labels
Storage formats (e.g. integer preservation — datamirror auto-detects when the source data contains only integer values and rounds the synthetic output to match)

Without this floor, the synthetic data isn't usable with the original code — a generate age = year - birthyear line fails silently if birthyear was stored as a string in the original but becomes a float in the synthetic version.

Layer 1 — Marginal distributions

Each variable's one-dimensional distribution matches the original.

Operational definition

Continuous variables: match at all 101 quantile points (p0, p1, ..., p100). Kolmogorov–Smirnov test with p > 0.05.
Categorical variables: match the frequency table exactly at all values observed in the original (post-SDC suppression — see privacy).

What this guarantees

Any analysis that only looks at one variable at a time (histograms, means, percentiles, summary statistics tables) produces statistically indistinguishable results on the synthetic data.

What this does NOT guarantee

Any two-variable relationship — correlations, crosstabs, regressions. A Layer-1-only synthetic dataset can perfectly match each variable's distribution while completely destroying all joint structure.

Layer 2 — Correlation structure

All pairwise numeric-variable correlations match the original.

Operational definition

Pearson correlation between any two numeric variables matches within r > 0.95 of the original.
Implemented via a Gaussian copula: sample from a multivariate normal with the original correlation structure, then transform each dimension to match its Layer-1 marginal.

What this guarantees

Two-variable relationships that are linear and numeric are preserved. Basic OLS on two continuous predictors yields coefficients close (but not exact) to the original. Standard exploratory data analysis — correlation heatmaps, scatter plots — produces the same picture.

What this does NOT guarantee

Nonlinear relationships, interactions between three or more variables, within-subgroup patterns, or inferential-grade coefficient replication.

Layer 3 — Stratification

All properties hold within strata.

Operational definition

Given a stratification variable (e.g. treatment group, wave, region), Layer 1 marginals and Layer 2 correlations are preserved within each stratum, not just aggregated.
Implemented by fitting independent copulas per stratum and concatenating the samples.
Compound stratification (e.g. treatment × wave) is supported via multi-variable strata.

What this guarantees

Subgroup analyses work. Treatment-effect estimates on a Layer-3 synthetic dataset produce coefficients close to the original for the treated vs. control groups separately. Interaction effects (treatment × moderator) are preserved when the moderator is part of the stratification.

What this does NOT guarantee

Exact replication of regression coefficients to inferential precision. Layers 1–3 together reproduce the shape of the data; the specific β values from any particular regression drift within a small but detectable range.

Layer 4 — Checkpoint constraints ⭐

The novel contribution. Regression coefficients from a declared set of "checkpoint" models match the original within a documented tolerance. Two principled methods, one for each family of estimator, with no learning rates or iteration knobs in either path.

Operational definition

Researcher runs their key regressions on the original data inside the secure environment and calls datamirror checkpoint, tag(...) after each one.
Layer 4 then produces an outcome vector y such that re-running the exact same regression on the synthetic data recovers the checkpointed coefficients.
Linear estimators (OLS, fixed effects, 2SLS) use a closed-form Newton step. Because β̂ is a linear functional of y, shifting y by X · (β* − β̂) updates β̂ by exactly β* − β̂. One step, exact up to floating-point precision.
Shared-outcome IV groups (multiple 2SLS specs on the same y) use a joint stacked Newton step that pins every coefficient constraint simultaneously.
Generalized linear models (logit, probit, poisson, nbreg) draw y fresh from the canonical data-generating process at the target coefficients. Fitting the matching GLM on the resulting y recovers β* within O(1/√N) sampling noise.

Fidelity target by model family

The convergence target is Δβ/SE < 3, where Δβ is the distance between the synthetic and target coefficient and SE is the synthetic regression's own standard error. This corresponds to a Bonferroni-corrected 99% simultaneous confidence region over the 3–5 coefficients per checkpoint the harness inspects. Validated on four AEA replication packages: 390+ coefficient comparisons, none exceed the threshold.

Family	Commands	Method
Linear	`regress`, `reghdfe`	Closed-form Newton via `matrix score`
IV	`ivregress 2sls` (single + joint)	Weighted-FWL Newton; joint stack for shared outcomes
Binary	`logit`, `probit`	Direct Bernoulli DGP
Count	`poisson`, `nbreg`	Direct Poisson / Gamma-Poisson DGP

Trade-off: outcome marginal drifts

For the linear family the Newton step adds a small vector to y; for the GLM family y is resampled from the target model. Predictor marginals (age, education, income, ...) are preserved in both cases. Outcome marginals may drift by O(1/√N) from the original — this is intentional and by design. Predictors are used across many analyses; outcomes usually belong to one model each, and pinning coefficients is the primary claim.

vs. other synthetic-data tools

Other synthetic-data generators stop at Layer 1, 2, or 3. datamirror's Layer 4 adds an explicit coefficient-preservation constraint, aimed at making synthetic data usable for inferential replication in register research.

Tool	Language	Typical tier	Note
`synthpop`	R	Layer 1–2	CART-based synthesis; preserves marginals + bivariate structure well
SDV (Gaussian copula)	Python	Layer 2	Correlation-preserving; no stratification by default
SDV (CTGAN)	Python	Layer 1–2	Deep-learning; strong marginals, weaker joint structure
Gretel / Mostly AI	Commercial	Layer 1–3	Stratification via conditional generation
datamirror	Stata (v1)	Layer 4	Explicit coefficient-preservation constraint via checkpointed estimation

Characterizations are approximate and reflect the publicly-documented strongest claim for each tool. A careful comparison paper is forthcoming.

How to cite a tier

A paper or replication package using synthetic data can state which tier their synthesis meets:

"Synthetic data for this paper was generated with datamirror (v1.0) targeting Layer 4 fidelity. Checkpoint coefficients recover the published estimates within Δβ/SE < 3 across the 14 specifications reported in Tables 2–4 (Bonferroni-corrected 99% simultaneous CI)."

A shorter form for a methods note:

"Synthetic replication data (datamirror, Layer 4)"

datamirror check prints the achieved Δβ/SE per checkpoint so you have a concrete number to cite.

Why a fidelity vocabulary

Schema (metadata) — the floor

Layer 1 — Marginal distributions

Operational definition

What this guarantees

What this does NOT guarantee

Layer 2 — Correlation structure

Operational definition

What this guarantees

What this does NOT guarantee

Layer 3 — Stratification

Operational definition

What this guarantees

What this does NOT guarantee

Layer 4 — Checkpoint constraints ⭐

Operational definition

Fidelity target by model family

Trade-off: outcome marginal drifts

vs. other synthetic-data tools

How to cite a tier

See also