1. Docs
  2. datamirror
  3. Privacy & SDC

datamirror · privacy

Privacy & Statistical Disclosure Control

datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. All privacy controls are governed by one global parameter with sane defaults aligned to the strictest national agency thresholds.

TL;DR

  • One parameter: DM_MIN_CELL_SIZE.
  • Default: 50 — stricter than any major national statistical agency minimum.
  • Groups / strata / categories with fewer than DM_MIN_CELL_SIZE observations are automatically suppressed from checkpoint output.
  • Continuous quantiles and aggregate correlation matrices are non-disclosive and don't need per-cell suppression.
  • Coefficient checkpoints are aggregate summary statistics; the same argument applies.

This page is the reference. For the full source-code-annotated version with line-number pointers, see datamirror/docs/PRIVACY.md in the repo.

The one parameter: dm_min_cell_size

The threshold is resolved at the start of every datamirror init call. Three ways to set it; the nearest override wins (session option > persistent config > package default):

  1. Per session via the init option:
    datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
  2. Persistent across sessions via the registream config:
    registream config, dm_min_cell_size(20)
    Written to ~/.registream/config_stata.csv. View current value with registream info.
  3. Package default (fallback): $DM_MIN_CELL_SIZE = 50 at the top of _rs_datamirror_utils.ado. Used when neither of the above is set.

Whichever value wins the resolution is recorded in metadata.csv at extract time, so any reviewer inspecting the checkpoint bundle can verify the policy applied without re-running.

Thresholds by agency

ThresholdSafety levelUse caseAgencies
50 Maximum Public data release, multi-agency compliance All (strictest)
20 Standard Synthetic microdata for research Statistics Sweden (SCB), Eurostat microdata
10 Minimum Internal use within a secure environment US Census Bureau, UK ONS, Eurostat tables
5 InsufficientNot recommended Below most microdata standards

Alignment per agency (all verified against public guidance):

AgencyTablesMicrodataNotes
US Census Bureau 10 10 Standard for public-use microdata
UK ONS 10 10 Standard for all public releases
Statistics Sweden (SCB) 5 20 Stricter for microdata
Eurostat 5 20 Stricter for microdata
Statistics Canada 5 10 Context-dependent

Default of 50 is deliberately above all of these. It ensures that checkpoint output is acceptable to any agency at their strictest policy. Agencies comfortable with lower thresholds can relax.

What the threshold protects

Three areas of the checkpoint output apply the DM_MIN_CELL_SIZE test at write-time:

1. Overall categorical marginals

File: checkpoints/<dataset>/marginals_cat.csv

Categories with fewer than DM_MIN_CELL_SIZE observations are suppressed (not written). Example with default threshold 50:

  • Category rare_disease=Yes with n=35 → suppressed
  • Category rare_disease=No with n=7,465,964 → included

2. Stratified categorical marginals

File: checkpoints/<dataset>/marginals_cat_stratified.csv

Within each stratum, categories with fewer than DM_MIN_CELL_SIZE observations are suppressed. Example with stratification by region:

  • region=A, occupation=rare_job, n=15 → suppressed
  • region=A, occupation=teacher, n=45,230 → included

3. Stratified correlations

File: checkpoints/<dataset>/correlations_stratified.csv

Entire strata with fewer than DM_MIN_CELL_SIZE observations are skipped. Example:

  • rural_island=1 with n=12 → stratum skipped entirely
  • urban_city=1 with n=2,456,123 → correlations computed

Continuous variables — why they don't need suppression

Continuous-variable marginals (the 101-point quantile distribution in marginals_cont.csv) do not trigger disclosure risk because:

  • Quantiles are non-disclosive. The 101-point cumulative distribution doesn't reveal any individual's value — each quantile is a population-level statistic.
  • No cell counts. Continuous variables don't have categories with small counts to worry about.
  • Correlations use full sample. The full correlation matrix uses all observations, not per-cell counts.

This is consistent with how national statistical agencies treat continuous-variable summary statistics in public data products.

Correlation matrices — disclosure-risk assessment

Correlation matrices are low-risk and standard practice for statistical agencies. datamirror exports a full pairwise-Pearson correlation matrix (correlations.csv) without per-cell suppression.

Aggregate statistics, not individual data

A Pearson correlation is a population-level summary:

Correlation(X, Y) = Σ(xi - x̄)(yi - ȳ) / (n × σx × σy)
  • Each correlation averages over all n observations.
  • No individual's data can be recovered from a correlation coefficient.
  • Standard output from regression analyses worldwide.

Small groups have negligible influence

Mathematical reality with large n (e.g. n=7.5M):

Small group sizeInfluence on overall correlation
n=1 ~0.0000001
n=10 ~0.000001
n=49 ~0.000007
n=100 ~0.00001

Even a perfectly correlated small subgroup (r=1.0 internally) contributes orders of magnitude below the noise floor of a full-sample correlation. Correlation matrices are reliably non-disclosive at realistic register-data sample sizes.

What's in the correlation file

  • Continuous variables — all included
  • Low-cardinality numeric categoricals (household counts, age groups, …) — included
  • High-cardinality numeric categoricals — included
  • String categoricals — automatically excluded (not numerically meaningful for Pearson)

Checkpoint coefficients

Regression coefficients in checkpoints_coef.csv are aggregate summary statistics with the same low-risk profile as correlation matrices. A coefficient is computed over all n observations in the regression sample; no individual's data is recoverable from it. This matches how coefficients are reported in published papers without per-cell suppression.

Two caveats:

  • If the regression sample itself is too small (e.g. n < DM_MIN_CELL_SIZE), the checkpoint should not be extracted — the underlying sample is already below the disclosure threshold. datamirror warns on extract when this happens.
  • For fixed-effects models with many absorbed dummies, coefficients on individual fixed effects are not stored (reghdfe absorbs them); only the main-predictor coefficients are stored as checkpoint targets.

Verifying compliance

Data stewards reviewing a checkpoint directory before approving a file-out transfer can verify:

  1. Which threshold was used? Open metadata.csv and read the dm_min_cell_size row. That's the threshold applied to every pass of this extract.
  2. How many cells were suppressed? metadata.csv records n_cat_categories / n_cat_suppressed for the overall categorical marginals, and n_cat_categories_strat / n_cat_suppressed_strat for the stratified pass.
  3. Any strata dropped from stratified correlations? metadata.csv records n_strata_skipped_corr: the count of strata skipped because they fell below the threshold.

For a strict review workflow:

* Set a stricter threshold once, persisted across sessions:
registream config, dm_min_cell_size(20)

* Or override for just this extract:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
datamirror extract, replace

* Then inspect metadata.csv for dm_min_cell_size and the suppression counts
* to confirm the extract matched the policy you intended.

See also

  • datamirror overview
  • Stata reference — privacy section summarizes this page
  • Fidelity tiers — the four-layer architecture and what each layer does
  • datamirror/docs/PRIVACY.md in the datamirror repo — source-code-annotated reference with line-number pointers to each suppression site