Veranstaltungen 15. Wissenschaftliche Tagung am 20. und 21. Juni 2024

Datenerhebung, Datenqualität und Datenethik in Zeiten von künstlicher Intelligenz

Effects of Training Data Collection Methods: Evidence of Annotation Sensitivit

Jacob Beck, Bolei Ma, Stephanie Eckman, Christoph Kern, Frauke Kreuter

Social Data Science and AI Lab (SODA) | Department of Statistics | LMU Munich


Machine learning algorithms require training data, usually annotated with ground truth labels by human annotators. While crowd-sourced training data contribute to model training, it is crucial to recognize that data annotation is more human-driven than a statistical process, and this human label variation has often been neglected [Plank, 2022]. Annotations are sensitive to annotator characteristics [Al Kuwatly et al., 2020], annotation materials, and instructions given to annotators [Parmar et al., 2023].

Our work proposes that the design and wording of the data collection instrument also impact data quality and the model trained on the resulting annotations. We introduce the term Annotation Sensitivity, referring to the impact of annotation data collection methods on the annotations themselves and on downstream model performance and predictions.

We collected 3,000 tweets and conducted annotations of hate speech and offensive language in five experimental conditions of an annotation instrument by 900 annotators, based on a pre-annotated tweet corpus for hate speech detection [Davidson et al., 2017]. The tweets were randomly grouped into ordered batches of 50 tweets and each annotator annotated one batch in one condition. To the best of our knowledge, we are the first to collect data using this method, and the resulting dataset could stand as a valuable resource for annotation sensitivity research.1

To investigate the annotation condition effect on model training and predictions, we fine-tune BERT [Devlin et al., 2019] models on each of the five conditions. We find considerable differences between the five conditions for model performance, model predictions, and model learning curves [Kern et al., 2023].

Due to the susceptibility of human annotators to cognitive biases, the order in which annotation objects are presented is likely to exert an impact on the annotations. Our recent analyses reveal a negative correlation between a tweet’s order within its batch and its likelihood of being annotated as both hate speech and OL across all five experimental conditions [Beck et al., 2024]. Tweets presented later to individuals are less likely to receive annotations.

Our findings confirm the existence of the annotation sensitivity effect, demonstrating that training data collection methods can affect model performance and data quality.

We illustrate the importance of incorporating social science theories regarding how people respond to questions and form judgments in the collection of high-quality training data. We call for attention to the training data collection methods in the era of data-centric AI.


[Al Kuwatly et al., 2020] Al Kuwatly, H., Wich, M., and Groh, G. (2020). Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Akiwowo, S., Vidgen, B., Prabhakaran, V., and Waseem, Z., editors, Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.

[Beck et al., 2024] Beck, J., Eckman, S., Ma, B., Chew, R., and Kreuter, F. (2024). Order effects in annotation tasks: Further evidence of annotation sensitivity. In Proceedings of the First Workshop on Uncertainty-Aware NLP, Malta. Association for Computational Linguistics.

[Davidson et al., 2017] Davidson, T., Warmsley, D., Macy, M., and Weber, I. (2017). Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):512–515.

[Devlin et al., 2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[Kern et al., 2023] Kern, C., Eckman, S., Beck, J., Chew, R., Ma, B., and Kreuter, F. (2023). Annotation sensitivity: Training data collection methods affect model performance. In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14874–14886, Singapore. Association for Computational Linguistics.

[Parmar et al., 2023] Parmar, M., Mishra, S., Geva, M., and Baral, C. (2023). Don’t blame the annotator: Bias already starts in the annotation instructions. In Vlachos, A. and Augenstein, I., editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics.

[Plank, 2022] Plank, B. (2022). The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y., editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

1: We made our dataset available on Huggingface: