Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 3.1 From Text to Code

An overview of STATEC's projects on automatic coding

Yu-Lin Huang* 1, Adrien Cannet-Delbosq1, Marie Walzer1, Laurent Maretti1, Claude Lamboray* 1, Maxime Cordy2, Yves Le Traons2

Abstract

Over the last years, STATEC conducted different projects for automatically coding text labels to a statistical classification using supervised Machine Learning methods.

  • In the Consumer Price Index, scanner data includes product labels which have to be mapped to COICOP.
  • In the Household Budget Survey, the descriptions of the purchased products provided by survey respondents have to be mapped to food categories of COICOP.
  • In EU-SILC and Household Budget surveys, respondents are asked to describe their professional occupation and employer. These replies are then mapped to ISCO and NACE.
  • In the survey on economic activities, respondents describe the activity of the legal unit which has to be mapped to NACE.

These different projects were developed as a response to a specific business need. While there are some similarities and common challenges, the projects differed in terms of size and characteristics of the input data, number of categories in the target classification, method used for classification, performance of the classifier, type of prediction (e.g. a specific class versus a list of classes), process for re-training, human-machine interaction and deployed tools. In this presentation, we will provide an overview of these approaches and we highlight some lessons that we can learn for implementing such projects in the context of official statistics.

*: Speaker

1: STATEC - Luxembourg

2: University of Luxembourg