Synthetic microdata and tables – An investigation of metrics
Simon Kolb* 1, Andreas Tang1
Abstract
Synthetic data has begun to show potential as an alternative to traditional SDC methods in specific use cases. Since data synthesis predominantly happens at microdata level, development of utility and risk metrics is also focused on this domain. Statistical agencies on the other hand limit data publication mostly to aggregates, by selecting various subsets of variables for cross tabulation. Traditional SDC methods like cell suppression tend to work on tabular level, which makes detailed knowledge of the published data product paramount. By generating synthetic microdata, no post tabular adjustments are required anymore. However, since tabular and microdata metrics can differ significantly, we aim to investigate the relationship between both.
Using a large real life data set as an example for data synthesis, we show that certain global metrics may disproportionately represent small subsets of variables, making them an inappropriate estimator for the quality of aggregates. On the other hand, we show strong similarities between certain microdata level risk metrics and risks of group disclosure in aggregated data.