Veranstaltungen 15. Wissenschaftliche Tagung am 20. und 21. Juni 2024

Datenerhebung, Datenqualität und Datenethik in Zeiten von künstlicher Intelligenz

Evaluating the risk of bots in web surveys recruited through social media

Saijal Shahania1, Joshua Claaßen2, Jan Karem Höhne2, David Broneske1


Web surveys are a predominant data collection method in social sciences. Their outcomes are key for political and social decision-making, including official statistics. Thus, the integrity and quality of data collected through web surveys is of utmost importance. However, as web surveys struggle with low response rates, researchers exploit new sources for respondent recruitment. This especially includes social media platforms offering sophisticated advertisement and target systems. Although social media recruitment provides quick access to a large respondent pool, the integrity and data quality of such web surveys is threatened by bots. Potentially, bots shift survey outcomes and thus political and social decisions. This is alarming since bots are already used to manipulate public opinion, such as during the Brexit-Referendum in 2016. The consequences of bots for web surveys are severe: 1) Bot-based responses may differ from human-based responses introducing measurement error. 2) Bots can undermine public trust in social research and its outcomes. 3) Bots can complete web surveys at high-speed, leading to financial damages. While there is ample literature on bots and how they infiltrate social media platforms, distribute fake news, and skew public opinion, research on the consequences of bots for web surveys and respondent recruitment through social media is scarce. In this study, we investigate the prevalence of bots in web surveys recruited through social media and their impact on data quality. We conduct a web survey in January 2024 that is based on social media recruitment and a bot detection experiment: 1) no bot detection method (control group), 2) challenge-response method (CAPTCHA group), and 3) email authentication method (authentication group). We field a web survey focusing on LGBTQ-related question topics and collect paradata with the open-source “Embedded Client Side Paradata (ECSP)” tool. We target a German sample of 1,500 respondents recruited via Facebook. Out of these respondents, 150 respondents receive 5€ as incentive (lottery approach). To detect bots, we analyze response behavior across the experimental conditions. This includes the analyses of textual responses (e.g., narrations), non-textual responses (e.g., ratings), and non-substantive responses (e.g., don’t know). This is accompanied by the analysis of paradata, such as response times and User-Agent-Strings. We flag potential units as bots following predefined rules, such as extremely fast responding and responding to so-called “honeypot” questions (invisible questions that only bots perceive). This also includes the analysis of test scores when it comes to CAPTCHA and human interaction challenges, helping to validate respondent authentication. Machine learning algorithms are then used in an unsupervised setting (no ground truth available) to analyze whether bots stand out from human-based response behavior. We extract features from non-textual (e.g., non-differentiation) and textual responses (e.g., Type Token Ratio and POS Structure) and employ NLP techniques to detect patterns of robotic language or unusual content. Our goal is to infer features that allow us to determine bot activities and their consequences for data quality. Thus, our study helps researchers and practitioners to collect high-quality data from social media recruited web surveys that are of integrity.

1: DZHW, University of Magdeburg

2: DZHW, Leibniz University Hannover