datamirror · privacy
Privacy & Statistical Disclosure Control
datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. Three controls govern the output — cell-size suppression, continuous tail-coding, and a capture-time disclosure check on each regression checkpoint — with defaults aligned to the strictest national agency thresholds.
TL;DR
- Three controls, each resolved the same way (init option > registream config > package default):
min_cell_size(default 50) — suppress small categorical cells and small strata.quantile_trim(default 1) — top/bottom-code continuous extremes.min_resid_df(default 10) — minimum residual degrees of freedom for a regression checkpoint.
- The default cell size of 50 is stricter than any major national statistical agency minimum.
- Groups / strata / categories with fewer than
min_cell_sizeobservations are automatically suppressed from checkpoint output. - Each regression checkpoint passes a capture-time disclosure check: reject if sample N is below
min_cell_size, reject if residual df is belowmin_resid_df, and drop any coefficient describing a sub-min_cell_sizegroup while keeping the rest. - Continuous quantiles (after tail-coding) and aggregate Spearman correlation matrices are non-disclosive and don't need per-cell suppression.
This page is the reference. For the full annotated version, see datamirror/docs/PRIVACY.md in the repo.
The privacy parameters
datamirror has three privacy controls. All three resolve the same way at the start of every datamirror init call; the nearest override wins (session option > persistent config > package default):
| Parameter | Default | Governs |
|---|---|---|
min_cell_size | 50 | Suppression of small categorical cells and small strata. |
quantile_trim | 1 | Top/bottom-coding of continuous extremes (see continuous variables). |
min_resid_df | 10 | Minimum residual degrees of freedom for a regression checkpoint (see checkpoint coefficients). |
The cell-size control is shown below; the other two are documented in their own sections. The resolution pattern is identical for all three:
- Per session via the init option:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20) - Persistent across sessions via the registream config:
Written to
registream config, dm_min_cell_size(20)~/.registream/config_stata.csv. View current value withregistream info. - Package default (fallback):
$DM_MIN_CELL_SIZE = 50at the top of_dm_utils.ado. Used when neither of the above is set. Thequantile_trimandmin_resid_dfdefaults live in the same file.
Whichever values win the resolution are recorded in metadata.csv at extract time, so any reviewer inspecting the checkpoint bundle can verify the policy applied without re-running.
Thresholds by agency
| Threshold | Safety level | Use case | Agencies |
|---|---|---|---|
| 50 | Maximum | Public data release, multi-agency compliance | All (strictest) |
| 20 | Standard | Synthetic microdata for research | Statistics Sweden (SCB), Eurostat microdata |
| 10 | Minimum | Internal use within a secure environment | US Census Bureau, UK ONS, Eurostat tables |
| 5 | Insufficient | Not recommended | Below most microdata standards |
Alignment per agency (all verified against public guidance):
| Agency | Tables | Microdata | Notes |
|---|---|---|---|
| US Census Bureau | 10 | 10 | Standard for public-use microdata |
| UK ONS | 10 | 10 | Standard for all public releases |
| Statistics Sweden (SCB) | 5 | 20 | Stricter for microdata |
| Eurostat | 5 | 20 | Stricter for microdata |
| Statistics Canada | 5 | 10 | Context-dependent |
Default of 50 is deliberately above all of these. It ensures that checkpoint output is acceptable to any agency at their strictest policy. Agencies comfortable with lower thresholds can relax.
What the threshold protects
Three areas of the checkpoint output apply the DM_MIN_CELL_SIZE test at write-time:
1. Overall categorical marginals
File: <checkpoint_dir>/marginals_cat.csv
Categories with fewer than DM_MIN_CELL_SIZE observations are suppressed (not written). Example with default threshold 50:
- Category
rare_disease=Yeswith n=35 → suppressed - Category
rare_disease=Nowith n=7,465,964 → included
2. Stratified categorical marginals
File: <checkpoint_dir>/marginals_cat_stratified.csv
Within each stratum, categories with fewer than DM_MIN_CELL_SIZE observations are suppressed. Example with stratification by region:
region=A,occupation=rare_job, n=15 → suppressedregion=A,occupation=teacher, n=45,230 → included
3. Stratified correlations
File: <checkpoint_dir>/correlations_stratified.csv
Entire strata with fewer than DM_MIN_CELL_SIZE observations are skipped. Example:
rural_island=1with n=12 → stratum skipped entirelyurban_city=1with n=2,456,123 → correlations computed
Continuous variables — tail-coding the extremes
Continuous-variable marginals are stored as a 101-point quantile distribution in marginals_cont.csv. The interior quantiles (p1 through p99) are population-level statistics with no disclosure risk: the 1st percentile of a variable with millions of observations is not a single individual's value. The raw extremes (q0 = minimum, q100 = maximum) are a different matter — the oldest person in a region or the highest earner in a parish is a single-observation statistic, and a recipient reads it straight off min()/max() of the synthetic variable.
The quantile_trim control closes this channel by top/bottom-coding those extremes. At the default of 1, q0 is populated with the 1st percentile and q100 with the 99th percentile, so the synthetic support contracts to [p01, p99] without distorting the interior distribution. It resolves the same way as the other parameters:
- Per session:
datamirror init, ... quantile_trim(5) - Persistent:
registream config, dm_quantile_trim(5) - Package default:
1in_dm_utils.ado.
Set it to 0 only when the source data has already been tail-coded upstream by the data custodian; larger values (e.g. 5 for [p05, p95]) give stronger SDC at the cost of a more contracted support. Under stratification, the same trim applies within each stratum and strata below min_cell_size are skipped from marginals_cont_stratified.csv. The resolved value is recorded in metadata.csv as dm_quantile_trim.
Correlation matrices — disclosure-risk assessment
Correlation matrices are low-risk and standard practice for statistical agencies. datamirror exports a full pairwise Spearman rank correlation matrix (correlations.csv) without per-cell suppression. Spearman is the correct input for the Gaussian copula (it is invariant to the monotone marginal transforms), and like Pearson it is an aggregate over the full sample with the same low disclosure risk.
Aggregate statistics, not individual data
A rank correlation is a population-level summary, computed over every observation:
Correlation(X, Y) = Σ(ri - r̄)(si - s̄) / (n × σr × σs)
where ri, si are the ranks of xi, yi- Each correlation averages over all n observations.
- No individual's data can be recovered from a correlation coefficient.
- Standard output from regression analyses worldwide.
Small groups have negligible influence
Mathematical reality with large n (e.g. n=7.5M):
| Small group size | Influence on overall correlation |
|---|---|
| n=1 | ~0.0000001 |
| n=10 | ~0.000001 |
| n=49 | ~0.000007 |
| n=100 | ~0.00001 |
Even a perfectly correlated small subgroup (r=1.0 internally) contributes orders of magnitude below the noise floor of a full-sample correlation. Correlation matrices are reliably non-disclosive at realistic register-data sample sizes.
What's in the correlation file
- Continuous variables — all included
- Low-cardinality numeric categoricals (household counts, age groups, …) — included
- High-cardinality numeric categoricals — included
- String categoricals — automatically excluded (not numerically meaningful for a rank correlation)
Checkpoint coefficients
A normal regression coefficient is an aggregate summary statistic: it is computed over every observation in the estimation sample, and no individual's data is recoverable from it. But a checkpoint stores coefficients, standard errors, the command line, and N, and marginal suppression does not touch any of these. A regression run on very few people, or one whose coefficient is dominated by a tiny group, can functionally encode the underlying records — at the extreme, an OLS line through two points reproduces both points. So checkpoints are not assumed safe; each one passes a disclosure control at capture time, the only point where the estimation sample still exists to count cells.
The control runs across every supported estimator family (OLS, fixed effects, 2SLS, logit, probit, Poisson, negative binomial) and enforces three rules, governed by the same parameters as the marginals:
- Sample-size floor. A checkpoint with
e(N) < min_cell_sizeis rejected outright — the whole regression is too small to describe. Nothing is stored. - Residual-df floor. A checkpoint with fewer than
min_resid_dfresidual degrees of freedom (Nminus the number of estimated parameters) is rejected: a near-exact fit is close to an algebraic function of the rows and can encode the data. This is the standard safe-centre output-checking criterion. For ML estimators that don't exposee(df_r), the floor is computed fromNminus the coefficient count. - Per-category coefficient drop. A factor level or interaction cell describing fewer than
min_cell_sizepeople yields a coefficient dominated by that small group, so that single coefficient is dropped from the exported recipe while the rest of the regression is kept. The variable still exists in the synthetic data (it is generated from its marginal), so the regression still reproduces — only the small-group coefficient is withheld. If dropping leaves no real predictor, the whole checkpoint is rejected.
Continuous predictors are never dropped — a slope estimated over many rows is not a small-cell statistic; only categorical levels and interaction cells are. The number of coefficients omitted is recorded in metadata.csv as n_coef_suppressed. The residual-df floor is the third privacy parameter, min_resid_df (default 10), resolved with the same precedence as min_cell_size and quantile_trim.
Fixed effects absorbed by reghdfe are not in e(b), so they are not gated by the per-category check directly; over-absorption is caught by the residual-df floor instead.
Verifying compliance
Data stewards reviewing a checkpoint directory before approving a file-out transfer can verify:
- Which thresholds were used? Open
metadata.csvand readdm_min_cell_size,dm_quantile_trim, anddm_min_resid_df. Those are the three policies applied to every pass of this extract. - How many cells were suppressed?
metadata.csvrecordsn_cat_categories/n_cat_suppressedfor the overall categorical marginals, andn_cat_categories_strat/n_cat_suppressed_stratfor the stratified pass. - Any strata dropped?
metadata.csvrecordsn_strata_skipped_corr(stratified correlations) andn_strata_skipped_cont(stratified continuous marginals): the counts skipped because they fell below the threshold. - Any checkpoint coefficients dropped?
metadata.csvrecordsn_coef_suppressed: the number of small-group coefficients withheld by the capture-time checkpoint disclosure control.
One honest limitation: when exactly one cell of a categorical is suppressed, its count can be derived from the published residual (N in metadata.csv minus the sum of the published frequencies). This reveals a count, never an identity, and only within a stratum that is itself at least min_cell_size. Accurate reproduction is datamirror's purpose, so the final context-aware judgment on whether a residual count is sensitive belongs to the data center's own output check, which these controls are designed to conform to, not replace.
For a strict review workflow:
* Set a stricter threshold once, persisted across sessions:
registream config, dm_min_cell_size(20)
* Or override for just this extract:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
datamirror extract, replace
* Then inspect metadata.csv for dm_min_cell_size and the suppression counts
* to confirm the extract matched the policy you intended.See also
- datamirror overview
- Stata reference — privacy section summarizes this page
- Fidelity tiers — the four-layer architecture and what each layer does
datamirror/docs/PRIVACY.mdin the datamirror repo — annotated reference describing each suppression site