Veranstaltungen 15. Wissenschaftliche Tagung am 20. und 21. Juni 2024

Datenerhebung, Datenqualität und Datenethik in Zeiten von künstlicher Intelligenz

Improving Occupational Coding in the Mikrozensus: A Comparative Analysis of CATI and CAWI Data in Machine Learning Applications

Bernhard Hochstetter1, Olga Kononykhina2


One might think that relying on official statistical data as training data guarantees success. After all, the data is collected in a standardized manner, goes through the same quality checks, and exists in large volumes. However, the reality is more nuanced. SODA LMU and the Statistical Landesamt Baden-Württemberg have been working together to improve the occupational coding experience of the Mikrozensus. Mikrozensus is a legally mandatory study where, once selected, respondents answer same questions four times within two or four years. The data in Baden-Würtemberg is mainly collected through Computer Assisted Web Interview (CAWI) and Computer Assisted Telephone Interview (CATI) modes. Regardless the mode all answers are checked in the office and coders are expected to reach out and clarify any vague information before it becomes an official data point.

When it comes to occupational coding in CATI interviews, coders enter a respondent's job title as a text entry. They are then presented with a lengthy list of potential suggestions and must quickly select the most appropriate one. The same tool with a suggestion list is used in the CAWI data collection mode, which presents a new challenge for respondents answering online and independently: they may lack experience or motivation in searching for and selecting the correct occupation.

LMU's previous research on occupational coding has shown that machine learning (ML) can be successfully used for occupational classification, improving user experience by limiting the number of lookup options to just five.

Our joint project investigates three main areas. Firstly, we aim to determine which mode of data collection (CATI or CAWI) is more reliable as training data for the ML algorithm. Secondly, we evaluate how the volume of training data affects performance. Additionally, we want to identify any significant discrepancies in occupations collected through CATI methods compared to those collected through CAWI methods. We are assessing if training data from CATI interviews can reliably classify jobs from CAWI interviews, and vice versa.

To do this, we used 126,000 occupations collected via CAWI and 34,000 occupations collected via CATI from 2022 and 2023. Our findings indicate that CATI data is more reliable than CAWI data when an equal amount of CATI and CAWI data is used for model training (31,000 and 31,000). However, when all the volume of CAWI data (130,000) is used for training, the CAWI testing model performs better. We also observed that the quality of prediction decreases when the modes are mixed, such as using CATI data for training and CAWI data for testing.

In our presentation, we will provide a detailed analysis of the performance of models that differ by mode and volume of data. We will show case studies to illustrate which jobs are easy to classify, which jobs get different classifications in different modes, and where the algorithm is underperforming. We will further discuss the next steps for our project.

1: Statistisches Landesamt Baden-Württemberg

2: LMU München