Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 4.3 Methology II

Effects of Training Data Collection Methods: Evidence of Annotation Sensitivity

Bolei Ma* 1, Jacob Beck1, Stephanie Eckman2, Christoph Kern1, Frauke Kreuter1


Machine learning algorithms require training data, usually annotated with ground truth labels by human annotators. While crowd-sourced training data contribute to model training, it is crucial to recognize that data annotation is more human-driven than a statistical process. Annotations are sensitive to annotator characteristics, annotation materials, and instructions given to annotators. We introduce the term Annotation Sensitivity, and propose that the design and wording of the data collection instrument also impact data quality and the model. We collected 3,000 tweets and conducted annotations of hate speech and offensive language in five experimental conditions by 900 annotators, based on a pre-annotated tweet corpus for hate speech detection. The tweets were randomly grouped into ordered batches of 50 tweets and each annotator annotated one batch in one condition. To investigate the annotation condition effect on model training and predictions, we fine-tune BERT models on each of the five conditions. We find considerable differences between the five conditions for model performance, model predictions, and model learning curves. Due to the susceptibility of human annotators to cognitive biases, the order in which annotation objects are presented is likely to exert an impact on the annotations. Our further analyses reveal a negative correlation between a tweet's order within its batch and its likelihood of being annotated as both hate speech and OL across all five experimental conditions. Tweets presented later to individuals are less likely to receive annotations. Our findings confirm the existence of the annotation sensitivity effect, demonstrating that training data collection methods can affect model performance and data quality. We illustrate the importance of incorporating social science theories regarding how people respond to questions and form judgments in the collection of high-quality training data. We call for attention to the training data collection methods in the era of data-centric AI. (We made our dataset available on Huggingface:

*: Speaker

1: LMU Munich - Germany

2: University of Maryland - United States of America