Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.2 Data Validation and Imputation

"There can be only one" - Deduplicating personal records in census data using exact matching and Machine Learning techniques

Eszter Milibak* 1, Flora Samu1

Abstract

In 2022, Hungarian Central Statistical Office (HCSO) has completed its regular decennial census enumeration. Duplicate records tend to be present in census data. The primary deduplication efforts invested by HCSO have focused on developing a variety of exact matching methods for finding duplicated personal records. The present framework consists of 12 algorithms utilizing natural (name, gender, date of birth, address) and technical identifiers in different versions and combinations.

Besides exact matching rules, HCSO has been experimenting with Machine Learning based solutions as well. In this approach, we utilize the deduplication functionalities of Python package Dedupe, where pairs of personal records are classified as duplicates or distinct records by Regularized Logistic Regression models. Based on edit distances between the identifier variables in scope, the model provides a duplicate flag variable (with probabilities) as an output. Model performance is enhanced by active user labelling in case of record pairs with uncertain initial classification. Model parameterization can be optimized by providing the relative importance of precision and recall metrics. Eventually, the model provides a unique identifier for the duplicate records of each person based on hierarchical clustering with centroidal linkage.

We plan to outline possible directions of further development for both HCSO’s exact matching and Machine Learning deduplication processes. On the one hand, efforts on manual data checks after applying exact matching algorithms might be reduced by concentrating on cases with lower confidence scores coming from Machine Learning models. On the other hand, investigating the weights of identifier variables’ edit distances generated throughout the Dedupe training process provides insights about new exact matching algorithms to be implemented. Furthermore, Machine Learning processes used to deduplicate Hungarian census records might be enhanced by widening the range of identifier variables based on field knowledge, and by experimenting with alternative editing distance and classification functionalities. Cost and optimal use cases of both methodologies on census data are presently being assessed.

*: Speaker

1: Hungarian Central Statistical Office