Use of a large language model to derive the economic sector of businesses from unstructured text on economic activitiesstatistics
Gerald Heß* 1, Bahraminejad Tahmores1
Abstract
The German Federal Employment Agency processes and stores information on the economic activity of each employment business in Germany in terms of the 5-digit code of the national Classification of Economic Activities (WZ 2008, based on NACE) in its respective information system. The economic sector is reported by the companies themselves as part of an online application procedure for allocation of the company registration number.
The required quality assurance of this information is time-consuming and ties up a considerable amount of resources.
The approach presented examines whether it is possible to automatically derive the economic sector from unstructured text on economic activities. For this purpose, the text is processed by a specifically finetuned BERT language model and based on this input economic sectors are recommended.