datamirror privacy & SDC

TL;DR

One parameter: DM_MIN_CELL_SIZE.
Default: 50 — stricter than any major national statistical agency minimum.
Groups / strata / categories with fewer than DM_MIN_CELL_SIZE observations are automatically suppressed from checkpoint output.
Continuous quantiles and aggregate correlation matrices are non-disclosive and don't need per-cell suppression.
Coefficient checkpoints are aggregate summary statistics; the same argument applies.

This page is the reference. For the full source-code-annotated version with line-number pointers, see datamirror/docs/PRIVACY.md in the repo.

The one parameter: `dm_min_cell_size`

The threshold is resolved at the start of every datamirror init call. Three ways to set it; the nearest override wins (session option > persistent config > package default):

Per session via the init option:

datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)

Persistent across sessions via the registream config:
```
registream config, dm_min_cell_size(20)
```
Written to ~/.registream/config_stata.csv. View current value with registream info.
Package default (fallback): $DM_MIN_CELL_SIZE = 50 at the top of _rs_datamirror_utils.ado. Used when neither of the above is set.

Whichever value wins the resolution is recorded in metadata.csv at extract time, so any reviewer inspecting the checkpoint bundle can verify the policy applied without re-running.

Thresholds by agency

Threshold	Safety level	Use case	Agencies
50	Maximum	Public data release, multi-agency compliance	All (strictest)
20	Standard	Synthetic microdata for research	Statistics Sweden (SCB), Eurostat microdata
10	Minimum	Internal use within a secure environment	US Census Bureau, UK ONS, Eurostat tables
5	Insufficient	Not recommended	Below most microdata standards

Alignment per agency (all verified against public guidance):

Agency	Tables	Microdata	Notes
US Census Bureau	10	10	Standard for public-use microdata
UK ONS	10	10	Standard for all public releases
Statistics Sweden (SCB)	5	20	Stricter for microdata
Eurostat	5	20	Stricter for microdata
Statistics Canada	5	10	Context-dependent

Default of 50 is deliberately above all of these. It ensures that checkpoint output is acceptable to any agency at their strictest policy. Agencies comfortable with lower thresholds can relax.

What the threshold protects

Three areas of the checkpoint output apply the DM_MIN_CELL_SIZE test at write-time:

1. Overall categorical marginals

File: checkpoints/<dataset>/marginals_cat.csv

Categories with fewer than DM_MIN_CELL_SIZE observations are suppressed (not written). Example with default threshold 50:

Category rare_disease=Yes with n=35 → suppressed
Category rare_disease=No with n=7,465,964 → included

2. Stratified categorical marginals

File: checkpoints/<dataset>/marginals_cat_stratified.csv

Within each stratum, categories with fewer than DM_MIN_CELL_SIZE observations are suppressed. Example with stratification by region:

region=A, occupation=rare_job, n=15 → suppressed
region=A, occupation=teacher, n=45,230 → included

3. Stratified correlations

File: checkpoints/<dataset>/correlations_stratified.csv

Entire strata with fewer than DM_MIN_CELL_SIZE observations are skipped. Example:

rural_island=1 with n=12 → stratum skipped entirely
urban_city=1 with n=2,456,123 → correlations computed

Continuous variables — why they don't need suppression

Continuous-variable marginals (the 101-point quantile distribution in marginals_cont.csv) do not trigger disclosure risk because:

Quantiles are non-disclosive. The 101-point cumulative distribution doesn't reveal any individual's value — each quantile is a population-level statistic.
No cell counts. Continuous variables don't have categories with small counts to worry about.
Correlations use full sample. The full correlation matrix uses all observations, not per-cell counts.

This is consistent with how national statistical agencies treat continuous-variable summary statistics in public data products.

Correlation matrices — disclosure-risk assessment

Correlation matrices are low-risk and standard practice for statistical agencies. datamirror exports a full pairwise-Pearson correlation matrix (correlations.csv) without per-cell suppression.

Aggregate statistics, not individual data

A Pearson correlation is a population-level summary:

Correlation(X, Y) = Σ(xi - x̄)(yi - ȳ) / (n × σx × σy)

Each correlation averages over all n observations.
No individual's data can be recovered from a correlation coefficient.
Standard output from regression analyses worldwide.

Small groups have negligible influence

Mathematical reality with large n (e.g. n=7.5M):

Small group size	Influence on overall correlation
n=1	~0.0000001
n=10	~0.000001
n=49	~0.000007
n=100	~0.00001

Even a perfectly correlated small subgroup (r=1.0 internally) contributes orders of magnitude below the noise floor of a full-sample correlation. Correlation matrices are reliably non-disclosive at realistic register-data sample sizes.

What's in the correlation file

Continuous variables — all included
Low-cardinality numeric categoricals (household counts, age groups, …) — included
High-cardinality numeric categoricals — included
String categoricals — automatically excluded (not numerically meaningful for Pearson)

Checkpoint coefficients

Regression coefficients in checkpoints_coef.csv are aggregate summary statistics with the same low-risk profile as correlation matrices. A coefficient is computed over all n observations in the regression sample; no individual's data is recoverable from it. This matches how coefficients are reported in published papers without per-cell suppression.

Two caveats:

If the regression sample itself is too small (e.g. n < DM_MIN_CELL_SIZE), the checkpoint should not be extracted — the underlying sample is already below the disclosure threshold. datamirror warns on extract when this happens.
For fixed-effects models with many absorbed dummies, coefficients on individual fixed effects are not stored (reghdfe absorbs them); only the main-predictor coefficients are stored as checkpoint targets.

Verifying compliance

Data stewards reviewing a checkpoint directory before approving a file-out transfer can verify:

Which threshold was used? Open metadata.csv and read the dm_min_cell_size row. That's the threshold applied to every pass of this extract.
How many cells were suppressed? metadata.csv records n_cat_categories / n_cat_suppressed for the overall categorical marginals, and n_cat_categories_strat / n_cat_suppressed_strat for the stratified pass.
Any strata dropped from stratified correlations? metadata.csv records n_strata_skipped_corr: the count of strata skipped because they fell below the threshold.

For a strict review workflow:

* Set a stricter threshold once, persisted across sessions:
registream config, dm_min_cell_size(20)

* Or override for just this extract:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
datamirror extract, replace

* Then inspect metadata.csv for dm_min_cell_size and the suppression counts
* to confirm the extract matched the policy you intended.

Privacy & Statistical Disclosure Control

TL;DR

The one parameter: `dm_min_cell_size`

Thresholds by agency

What the threshold protects

1. Overall categorical marginals

2. Stratified categorical marginals

3. Stratified correlations

Continuous variables — why they don't need suppression

Correlation matrices — disclosure-risk assessment

Aggregate statistics, not individual data

Small groups have negligible influence

What's in the correlation file

Checkpoint coefficients

Verifying compliance

See also

TL;DR

The one parameter: dm_min_cell_size

Thresholds by agency

What the threshold protects

1. Overall categorical marginals

2. Stratified categorical marginals

3. Stratified correlations

Continuous variables — why they don't need suppression

Correlation matrices — disclosure-risk assessment

Aggregate statistics, not individual data

Small groups have negligible influence

What's in the correlation file

Checkpoint coefficients

Verifying compliance

See also

The one parameter: `dm_min_cell_size`