Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.2 Data Validation and Imputation

Applying k-NN utilizing similarity measures for categorical data for anomaly detection in the German Federal Employment Agency’s statistics

Hinnerk Müller* 1, Daniel Lechmann1

Abstract

The Federal Employment Agency's statistical office needs to monitor the validity of its data to ensure the correctness of published statistics. To this end, we test whether the k-NN algorithm can assist in detecting anomalies in the data underlying official statistics. Specifically, we search for anomalies in individual-level unemployment records using the (average) distance to the k nearest neighbors of each observation. As the scale of measure is nominal for most features in the data, we use similarity measures designed for categorical data to compute the distance between observations. We find that for this use case computational requirements are high, so we use a subset of observations to achieve acceptable computation time. This subset consists of new entries into unemployment in a specific month. Additionally, we develop an explanatory approach for the anomalies we discover when using the Overlap similarity measure. This approach calculates the fraction each feature contributes to an observation’s distance to its neighborhood.

*: Speaker

1: Federal Employment Agency - Germany