Identification risk on microdata sets
Valentina Wolff Lirio1, Rita de Sousa* 2, Susana Faria1
Abstract
Nowadays, with the increase in demand for data and statistical information, it is essential to implement a set of confidentiality control methodologies. Data privacy concerns generally arise from legal reasons, related to the protection of individual confidentiality, with the aim of providing the best and greatest amount of information possible without compromising its quality and privacy. Most of the data from certain statistical units corresponds to individual information, known as microdata. Statistical institutions have the great challenge of guaranteeing the confidentiality of statistical units in the dissemination of a detailed microdata set (Benschop et al., 2021).
The use of robust Statistical Disclosure Control (SDC) methodologies and a balance between data utility and the preservation of privacy are very important (Templ and Sariyar, 2022), especially in an era of large amounts of data and legal restrictions, such as those established by the General Regulation Data Protection Regulation (GDPR) of the European Union.
This article provides a comprehensive overview of the main concepts, different anonymization methodologies and emphasizes the importance of assessing the risk of identification and the loss of information. The main identification risk assessment measures for categorical and numerical variables are presented. Particular attention is paid to the risk of identification in longitudinal data, with proposed methodologies aimed at improving privacy protection.
There is a discussion on the application of these concepts in real-world scenarios, namely in a financial database, highlighting ongoing research efforts to address privacy challenges in making individual microdata available on a panel (Li et al., 2023). Through a case study involving a financial database, specifically the microdata base of the Credit Responsibility Center (CCR) of the Bank of Portugal (BdP), a practical application is made with R (Templ, 2017) of these methodologies and ongoing research efforts to address privacy challenges are highlighted.
References
Benschop, T., Machingauta, C., & Welch, M. (2021). Statistical disclosure control: A practice guide. The World Bank.
Li, S., Schneider, M. J., Yu, Y., & Gupta, S. (2023). Reidentification risk in panel data: Protecting for k-anonymity. Information Systems Research, 34(3), pp. 1066-1088.
Templ, M., & Sariyar, M. (2022). A systematic overview on methods to protect sensitive data provided for various analyses. International Journal of Information Security, 21(6), pp. 1233-
1246.
Templ, M. (2017). Statistical disclosure control for microdata. Springer.
*: Speaker
1: University of Minho, Portugal
2: Bank of Portugal