An overview of STATEC's projects on automatic coding
Yu-Lin Huang* 1, Adrien Cannet-Delbosq1, Marie Walzer1, Laurent Maretti1, Claude Lamboray* 1, Maxime Cordy2, Yves Le Traons2
Abstract
Over the last years, STATEC conducted different projects for automatically coding text labels to a statistical classification using supervised Machine Learning methods.
- In the Consumer Price Index, scanner data includes product labels which have to be mapped to COICOP.
- In the Household Budget Survey, the descriptions of the purchased products provided by survey respondents have to be mapped to food categories of COICOP.
- In EU-SILC and Household Budget surveys, respondents are asked to describe their professional occupation and employer. These replies are then mapped to ISCO and NACE.
- In the survey on economic activities, respondents describe the activity of the legal unit which has to be mapped to NACE.
These different projects were developed as a response to a specific business need. While there are some similarities and common challenges, the projects differed in terms of size and characteristics of the input data, number of categories in the target classification, method used for classification, performance of the classifier, type of prediction (e.g. a specific class versus a list of classes), process for re-training, human-machine interaction and deployed tools. In this presentation, we will provide an overview of these approaches and we highlight some lessons that we can learn for implementing such projects in the context of official statistics.