datamirror · privacy
Privacy & Statistical Disclosure Control
datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. All privacy controls are governed by one global parameter with sane defaults aligned to the strictest national agency thresholds.
TL;DR
- One parameter:
DM_MIN_CELL_SIZE. - Default: 50 — stricter than any major national statistical agency minimum.
- Groups / strata / categories with fewer than
DM_MIN_CELL_SIZEobservations are automatically suppressed from checkpoint output. - Continuous quantiles and aggregate correlation matrices are non-disclosive and don't need per-cell suppression.
- Coefficient checkpoints are aggregate summary statistics; the same argument applies.
This page is the reference. For the full source-code-annotated version with line-number pointers, see datamirror/docs/PRIVACY.md in the repo.
The one parameter: dm_min_cell_size
The threshold is resolved at the start of every datamirror init call. Three ways to set it; the nearest override wins (session option > persistent config > package default):
- Per session via the init option:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20) - Persistent across sessions via the registream config:
Written to
registream config, dm_min_cell_size(20)~/.registream/config_stata.csv. View current value withregistream info. - Package default (fallback):
$DM_MIN_CELL_SIZE = 50at the top of_rs_datamirror_utils.ado. Used when neither of the above is set.
Whichever value wins the resolution is recorded in metadata.csv at extract time, so any reviewer inspecting the checkpoint bundle can verify the policy applied without re-running.
Thresholds by agency
| Threshold | Safety level | Use case | Agencies |
|---|---|---|---|
| 50 | Maximum | Public data release, multi-agency compliance | All (strictest) |
| 20 | Standard | Synthetic microdata for research | Statistics Sweden (SCB), Eurostat microdata |
| 10 | Minimum | Internal use within a secure environment | US Census Bureau, UK ONS, Eurostat tables |
| 5 | Insufficient | Not recommended | Below most microdata standards |
Alignment per agency (all verified against public guidance):
| Agency | Tables | Microdata | Notes |
|---|---|---|---|
| US Census Bureau | 10 | 10 | Standard for public-use microdata |
| UK ONS | 10 | 10 | Standard for all public releases |
| Statistics Sweden (SCB) | 5 | 20 | Stricter for microdata |
| Eurostat | 5 | 20 | Stricter for microdata |
| Statistics Canada | 5 | 10 | Context-dependent |
Default of 50 is deliberately above all of these. It ensures that checkpoint output is acceptable to any agency at their strictest policy. Agencies comfortable with lower thresholds can relax.
What the threshold protects
Three areas of the checkpoint output apply the DM_MIN_CELL_SIZE test at write-time:
1. Overall categorical marginals
File: checkpoints/<dataset>/marginals_cat.csv
Categories with fewer than DM_MIN_CELL_SIZE observations are suppressed (not written). Example with default threshold 50:
- Category
rare_disease=Yeswith n=35 → suppressed - Category
rare_disease=Nowith n=7,465,964 → included
2. Stratified categorical marginals
File: checkpoints/<dataset>/marginals_cat_stratified.csv
Within each stratum, categories with fewer than DM_MIN_CELL_SIZE observations are suppressed. Example with stratification by region:
region=A,occupation=rare_job, n=15 → suppressedregion=A,occupation=teacher, n=45,230 → included
3. Stratified correlations
File: checkpoints/<dataset>/correlations_stratified.csv
Entire strata with fewer than DM_MIN_CELL_SIZE observations are skipped. Example:
rural_island=1with n=12 → stratum skipped entirelyurban_city=1with n=2,456,123 → correlations computed
Continuous variables — why they don't need suppression
Continuous-variable marginals (the 101-point quantile distribution in marginals_cont.csv) do not trigger disclosure risk because:
- Quantiles are non-disclosive. The 101-point cumulative distribution doesn't reveal any individual's value — each quantile is a population-level statistic.
- No cell counts. Continuous variables don't have categories with small counts to worry about.
- Correlations use full sample. The full correlation matrix uses all observations, not per-cell counts.
This is consistent with how national statistical agencies treat continuous-variable summary statistics in public data products.
Correlation matrices — disclosure-risk assessment
Correlation matrices are low-risk and standard practice for statistical agencies. datamirror exports a full pairwise-Pearson correlation matrix (correlations.csv) without per-cell suppression.
Aggregate statistics, not individual data
A Pearson correlation is a population-level summary:
Correlation(X, Y) = Σ(xi - x̄)(yi - ȳ) / (n × σx × σy)- Each correlation averages over all n observations.
- No individual's data can be recovered from a correlation coefficient.
- Standard output from regression analyses worldwide.
Small groups have negligible influence
Mathematical reality with large n (e.g. n=7.5M):
| Small group size | Influence on overall correlation |
|---|---|
| n=1 | ~0.0000001 |
| n=10 | ~0.000001 |
| n=49 | ~0.000007 |
| n=100 | ~0.00001 |
Even a perfectly correlated small subgroup (r=1.0 internally) contributes orders of magnitude below the noise floor of a full-sample correlation. Correlation matrices are reliably non-disclosive at realistic register-data sample sizes.
What's in the correlation file
- Continuous variables — all included
- Low-cardinality numeric categoricals (household counts, age groups, …) — included
- High-cardinality numeric categoricals — included
- String categoricals — automatically excluded (not numerically meaningful for Pearson)
Checkpoint coefficients
Regression coefficients in checkpoints_coef.csv are aggregate summary statistics with the same low-risk profile as correlation matrices. A coefficient is computed over all n observations in the regression sample; no individual's data is recoverable from it. This matches how coefficients are reported in published papers without per-cell suppression.
Two caveats:
- If the regression sample itself is too small (e.g. n <
DM_MIN_CELL_SIZE), the checkpoint should not be extracted — the underlying sample is already below the disclosure threshold. datamirror warns on extract when this happens. - For fixed-effects models with many absorbed dummies, coefficients on individual fixed effects are not stored (reghdfe absorbs them); only the main-predictor coefficients are stored as checkpoint targets.
Verifying compliance
Data stewards reviewing a checkpoint directory before approving a file-out transfer can verify:
- Which threshold was used? Open
metadata.csvand read thedm_min_cell_sizerow. That's the threshold applied to every pass of this extract. - How many cells were suppressed?
metadata.csvrecordsn_cat_categories/n_cat_suppressedfor the overall categorical marginals, andn_cat_categories_strat/n_cat_suppressed_stratfor the stratified pass. - Any strata dropped from stratified correlations?
metadata.csvrecordsn_strata_skipped_corr: the count of strata skipped because they fell below the threshold.
For a strict review workflow:
* Set a stricter threshold once, persisted across sessions:
registream config, dm_min_cell_size(20)
* Or override for just this extract:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
datamirror extract, replace
* Then inspect metadata.csv for dm_min_cell_size and the suppression counts
* to confirm the extract matched the policy you intended.See also
- datamirror overview
- Stata reference — privacy section summarizes this page
- Fidelity tiers — the four-layer architecture and what each layer does
datamirror/docs/PRIVACY.mdin the datamirror repo — source-code-annotated reference with line-number pointers to each suppression site