Applying k-NN utilizing similarity measures for categorical data for anomaly detection in the German Federal Employment Agency’s statistics
Hinnerk Müller* 1, Daniel Lechmann1
Abstract
The Federal Employment Agency's statistical office needs to monitor the validity of its data to ensure the correctness of published statistics. To this end, we test whether the k-NN algorithm can assist in detecting anomalies in the data underlying official statistics. Specifically, we search for anomalies in individual-level unemployment records using the (average) distance to the k nearest neighbors of each observation. As the scale of measure is nominal for most features in the data, we use similarity measures designed for categorical data to compute the distance between observations. We find that for this use case computational requirements are high, so we use a subset of observations to achieve acceptable computation time. This subset consists of new entries into unemployment in a specific month. Additionally, we develop an explanatory approach for the anomalies we discover when using the Overlap similarity measure. This approach calculates the fraction each feature contributes to an observation’s distance to its neighborhood.