Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 2.1 Text Classification and Language Models

Use of a large language model to derive the economic sector of businesses from unstructured text on economic activitiesstatistics

Gerald Heß* 1, Bahraminejad Tahmores1


The German Federal Employment Agency processes and stores information on the economic activity of each employment business in Germany in terms of the 5-digit code of the national Classification of Economic Activities (WZ 2008, based on NACE) in its respective information system. The economic sector is reported by the companies themselves as part of an online application procedure for allocation of the company registration number.

The required quality assurance of this information is time-consuming and ties up a considerable amount of resources.

The approach presented examines whether it is possible to automatically derive the economic sector from unstructured text on economic activities. For this purpose, the text is processed by a specifically finetuned BERT language model and based on this input economic sectors are recommended.

*: Speaker

1: Federal Employment Agency - Germany