Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 2.1 Text Classification and Language Models

Automatic text classification for the german Household budget survey

Jerome Olsen* 1, Ariane Lestrade* 1, Bogdan Levagin1


Household budget surveys (HBS) provide vital data to assess the income situation and identify consumption patterns of the population as a whole and its various groups. In these surveys, participants record their daily purchases over a three-month period using either a digital app or a paper diary. All recorded purchases must be categorized according to the COICOP classification system. Data from the app are pre-classified by users through a search algorithm, while entries from paper diaries require subsequent processing and (so far) manual classification. In the 2023 german edition of the HBS, Destatis' largest voluntary household survey also known as Einkommens- und Verbrauchsstichprobe (EVS), we anticipate the need to classify approximately 5 million purchase entries. To manage this volume, we have developed and implemented a machine learning procedure which is currently being used in production. Furthermore, model training with the latest implementation is reaching the limits of the current infrastructure and classifications of large amounts of data are expected on a weekly basis. Hence, there is an ongoing project for a potential migration to a big data environment (Cloudera), which is equipped with the necessary frameworks. Our presentation will outline the development process and ongoing transition, detail the final model specifications, and share insights, such as challenges encountered and lessons learned, from its practical application.

*: Speaker

1: Federal Statistical Office - Germany