1. Docs
  2. datamirror
  3. Privacy & SDC

datamirror · privacy

Privacy & Statistical Disclosure Control

datamirror implements Statistical Disclosure Control (SDC) so that checkpoint metadata cannot be used to identify small groups or individuals. Three controls govern the output — cell-size suppression, continuous tail-coding, and a capture-time disclosure check on each regression checkpoint — with defaults aligned to the strictest national agency thresholds.

TL;DR

  • Three controls, each resolved the same way (init option > registream config > package default):
    • min_cell_size (default 50) — suppress small categorical cells and small strata.
    • quantile_trim (default 1) — top/bottom-code continuous extremes.
    • min_resid_df (default 10) — minimum residual degrees of freedom for a regression checkpoint.
  • The default cell size of 50 is stricter than any major national statistical agency minimum.
  • Groups / strata / categories with fewer than min_cell_size observations are automatically suppressed from checkpoint output.
  • Each regression checkpoint passes a capture-time disclosure check: reject if sample N is below min_cell_size, reject if residual df is below min_resid_df, and drop any coefficient describing a sub-min_cell_size group while keeping the rest.
  • Continuous quantiles (after tail-coding) and aggregate Spearman correlation matrices are non-disclosive and don't need per-cell suppression.

This page is the reference. For the full annotated version, see datamirror/docs/PRIVACY.md in the repo.

The privacy parameters

datamirror has three privacy controls. All three resolve the same way at the start of every datamirror init call; the nearest override wins (session option > persistent config > package default):

ParameterDefaultGoverns
min_cell_size 50 Suppression of small categorical cells and small strata.
quantile_trim 1 Top/bottom-coding of continuous extremes (see continuous variables).
min_resid_df 10 Minimum residual degrees of freedom for a regression checkpoint (see checkpoint coefficients).

The cell-size control is shown below; the other two are documented in their own sections. The resolution pattern is identical for all three:

  1. Per session via the init option:
    datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
  2. Persistent across sessions via the registream config:
    registream config, dm_min_cell_size(20)
    Written to ~/.registream/config_stata.csv. View current value with registream info.
  3. Package default (fallback): $DM_MIN_CELL_SIZE = 50 at the top of _dm_utils.ado. Used when neither of the above is set. The quantile_trim and min_resid_df defaults live in the same file.

Whichever values win the resolution are recorded in metadata.csv at extract time, so any reviewer inspecting the checkpoint bundle can verify the policy applied without re-running.

Thresholds by agency

ThresholdSafety levelUse caseAgencies
50 Maximum Public data release, multi-agency compliance All (strictest)
20 Standard Synthetic microdata for research Statistics Sweden (SCB), Eurostat microdata
10 Minimum Internal use within a secure environment US Census Bureau, UK ONS, Eurostat tables
5 InsufficientNot recommended Below most microdata standards

Alignment per agency (all verified against public guidance):

AgencyTablesMicrodataNotes
US Census Bureau 10 10 Standard for public-use microdata
UK ONS 10 10 Standard for all public releases
Statistics Sweden (SCB) 5 20 Stricter for microdata
Eurostat 5 20 Stricter for microdata
Statistics Canada 5 10 Context-dependent

Default of 50 is deliberately above all of these. It ensures that checkpoint output is acceptable to any agency at their strictest policy. Agencies comfortable with lower thresholds can relax.

What the threshold protects

Three areas of the checkpoint output apply the DM_MIN_CELL_SIZE test at write-time:

1. Overall categorical marginals

File: <checkpoint_dir>/marginals_cat.csv

Categories with fewer than DM_MIN_CELL_SIZE observations are suppressed (not written). Example with default threshold 50:

  • Category rare_disease=Yes with n=35 → suppressed
  • Category rare_disease=No with n=7,465,964 → included

2. Stratified categorical marginals

File: <checkpoint_dir>/marginals_cat_stratified.csv

Within each stratum, categories with fewer than DM_MIN_CELL_SIZE observations are suppressed. Example with stratification by region:

  • region=A, occupation=rare_job, n=15 → suppressed
  • region=A, occupation=teacher, n=45,230 → included

3. Stratified correlations

File: <checkpoint_dir>/correlations_stratified.csv

Entire strata with fewer than DM_MIN_CELL_SIZE observations are skipped. Example:

  • rural_island=1 with n=12 → stratum skipped entirely
  • urban_city=1 with n=2,456,123 → correlations computed

Continuous variables — tail-coding the extremes

Continuous-variable marginals are stored as a 101-point quantile distribution in marginals_cont.csv. The interior quantiles (p1 through p99) are population-level statistics with no disclosure risk: the 1st percentile of a variable with millions of observations is not a single individual's value. The raw extremes (q0 = minimum, q100 = maximum) are a different matter — the oldest person in a region or the highest earner in a parish is a single-observation statistic, and a recipient reads it straight off min()/max() of the synthetic variable.

The quantile_trim control closes this channel by top/bottom-coding those extremes. At the default of 1, q0 is populated with the 1st percentile and q100 with the 99th percentile, so the synthetic support contracts to [p01, p99] without distorting the interior distribution. It resolves the same way as the other parameters:

  • Per session: datamirror init, ... quantile_trim(5)
  • Persistent: registream config, dm_quantile_trim(5)
  • Package default: 1 in _dm_utils.ado.

Set it to 0 only when the source data has already been tail-coded upstream by the data custodian; larger values (e.g. 5 for [p05, p95]) give stronger SDC at the cost of a more contracted support. Under stratification, the same trim applies within each stratum and strata below min_cell_size are skipped from marginals_cont_stratified.csv. The resolved value is recorded in metadata.csv as dm_quantile_trim.

Correlation matrices — disclosure-risk assessment

Correlation matrices are low-risk and standard practice for statistical agencies. datamirror exports a full pairwise Spearman rank correlation matrix (correlations.csv) without per-cell suppression. Spearman is the correct input for the Gaussian copula (it is invariant to the monotone marginal transforms), and like Pearson it is an aggregate over the full sample with the same low disclosure risk.

Aggregate statistics, not individual data

A rank correlation is a population-level summary, computed over every observation:

Correlation(X, Y) = Σ(ri - r̄)(si - s̄) / (n × σr × σs)
  where ri, si are the ranks of xi, yi
  • Each correlation averages over all n observations.
  • No individual's data can be recovered from a correlation coefficient.
  • Standard output from regression analyses worldwide.

Small groups have negligible influence

Mathematical reality with large n (e.g. n=7.5M):

Small group sizeInfluence on overall correlation
n=1 ~0.0000001
n=10 ~0.000001
n=49 ~0.000007
n=100 ~0.00001

Even a perfectly correlated small subgroup (r=1.0 internally) contributes orders of magnitude below the noise floor of a full-sample correlation. Correlation matrices are reliably non-disclosive at realistic register-data sample sizes.

What's in the correlation file

  • Continuous variables — all included
  • Low-cardinality numeric categoricals (household counts, age groups, …) — included
  • High-cardinality numeric categoricals — included
  • String categoricals — automatically excluded (not numerically meaningful for a rank correlation)

Checkpoint coefficients

A normal regression coefficient is an aggregate summary statistic: it is computed over every observation in the estimation sample, and no individual's data is recoverable from it. But a checkpoint stores coefficients, standard errors, the command line, and N, and marginal suppression does not touch any of these. A regression run on very few people, or one whose coefficient is dominated by a tiny group, can functionally encode the underlying records — at the extreme, an OLS line through two points reproduces both points. So checkpoints are not assumed safe; each one passes a disclosure control at capture time, the only point where the estimation sample still exists to count cells.

The control runs across every supported estimator family (OLS, fixed effects, 2SLS, logit, probit, Poisson, negative binomial) and enforces three rules, governed by the same parameters as the marginals:

  1. Sample-size floor. A checkpoint with e(N) < min_cell_size is rejected outright — the whole regression is too small to describe. Nothing is stored.
  2. Residual-df floor. A checkpoint with fewer than min_resid_df residual degrees of freedom (N minus the number of estimated parameters) is rejected: a near-exact fit is close to an algebraic function of the rows and can encode the data. This is the standard safe-centre output-checking criterion. For ML estimators that don't expose e(df_r), the floor is computed from N minus the coefficient count.
  3. Per-category coefficient drop. A factor level or interaction cell describing fewer than min_cell_size people yields a coefficient dominated by that small group, so that single coefficient is dropped from the exported recipe while the rest of the regression is kept. The variable still exists in the synthetic data (it is generated from its marginal), so the regression still reproduces — only the small-group coefficient is withheld. If dropping leaves no real predictor, the whole checkpoint is rejected.

Continuous predictors are never dropped — a slope estimated over many rows is not a small-cell statistic; only categorical levels and interaction cells are. The number of coefficients omitted is recorded in metadata.csv as n_coef_suppressed. The residual-df floor is the third privacy parameter, min_resid_df (default 10), resolved with the same precedence as min_cell_size and quantile_trim.

Fixed effects absorbed by reghdfe are not in e(b), so they are not gated by the per-category check directly; over-absorption is caught by the residual-df floor instead.

Verifying compliance

Data stewards reviewing a checkpoint directory before approving a file-out transfer can verify:

  1. Which thresholds were used? Open metadata.csv and read dm_min_cell_size, dm_quantile_trim, and dm_min_resid_df. Those are the three policies applied to every pass of this extract.
  2. How many cells were suppressed? metadata.csv records n_cat_categories / n_cat_suppressed for the overall categorical marginals, and n_cat_categories_strat / n_cat_suppressed_strat for the stratified pass.
  3. Any strata dropped? metadata.csv records n_strata_skipped_corr (stratified correlations) and n_strata_skipped_cont (stratified continuous marginals): the counts skipped because they fell below the threshold.
  4. Any checkpoint coefficients dropped? metadata.csv records n_coef_suppressed: the number of small-group coefficients withheld by the capture-time checkpoint disclosure control.

One honest limitation: when exactly one cell of a categorical is suppressed, its count can be derived from the published residual (N in metadata.csv minus the sum of the published frequencies). This reveals a count, never an identity, and only within a stratum that is itself at least min_cell_size. Accurate reproduction is datamirror's purpose, so the final context-aware judgment on whether a residual count is sensitive belongs to the data center's own output check, which these controls are designed to conform to, not replace.

For a strict review workflow:

* Set a stricter threshold once, persisted across sessions:
registream config, dm_min_cell_size(20)

* Or override for just this extract:
datamirror init, checkpoint_dir("ckpt") replace min_cell_size(20)
datamirror extract, replace

* Then inspect metadata.csv for dm_min_cell_size and the suppression counts
* to confirm the extract matched the policy you intended.

See also

  • datamirror overview
  • Stata reference — privacy section summarizes this page
  • Fidelity tiers — the four-layer architecture and what each layer does
  • datamirror/docs/PRIVACY.md in the datamirror repo — annotated reference describing each suppression site