Veranstaltungen 15. Wissenschaftliche Tagung am 20. und 21. Juni 2024

Datenerhebung, Datenqualität und Datenethik in Zeiten von künstlicher Intelligenz

Large Language Models and Natural Language Processing (NLP) in the social sciences: A validation framework for measuring social science constructs

Lukas Birkenmaier1, Prof. Dr. Claudia Wagner2, Dr. Clemens Lechner1

Abstract

Guidance on how to validate computational text-based measures of social science constructs is fragmented. While social science scholars generally acknowledge the importance of validating their text-based measures, they often lack common terminology and a unified framework to do so. This paper introduces ValiTex, a new validation framework designed to assist scholars in validly measuring social science constructs based on textual data. ValiTex prescribes researchers to demonstrate three types of validity evidence: substantive evidence (outlining the theoretical underpinning of the measure),structural evidence (examining the properties of the text model and its output), and external evidence (testing for how the measure relates to independent information). In addition to the framework, ValiTex offers valuable practical guidance through a checklist that is adaptable for different use cases.The checklist clearly defines and outlines specific validation steps while also offering a knowledgeable evaluation of the importance of each validation step to establish validity.

We demonstrate the usefulness of the framework by applying it to a common use case in research on society: the detection of sexism from social media data. This application shows how ValiTex can be used in an area of great social significance, helping to identify and analyze subtle and explicitforms of sexism in online discourse. This is particularly relevant for politicians, researchers, and official statisticians, as it provides insights into social trends and public opinions.

In conclusion, ValiTex represents a crucial step towards a more rigorous and systematic methodology in social science research. It addresses an urgent need for standardized validation procedures in an era where text-based data is increasingly significant. With ValiTex, researchers, politicians, and official statisticians can conduct more reliable and meaningful analyses, ultimately leading to better informed and substantiated public policy and social discourse.

1: GESIS – Leibniz Institute for the Social Sciences, Mannheim, Germany

2: GESIS – Leibniz Institute for the Social Sciences, Mannheim; RWTH Aachen University, Aachen, Germany; Complexity Science Hub Vienna, Vienna, Austria