Table of Contents
Fetching ...

Corpus Considerations for Annotator Modeling and Scaling

Olufunke O. Sarumi, Béla Neuendorf, Joan Plepi, Lucie Flek, Jörg Schlötterer, Charles Welch

TL;DR

This work addresses the challenge of capturing diverse annotator perspectives in subjective NLP tasks by systematically evaluating annotator-modeling methods across seven binary-label corpora. It introduces and compares approaches including User Token, Composite Embedding, Composite Embedding with User Token, Multi-task, and Personalization techniques, using macro-F1 as the evaluation metric. The key finding is that simple, scalable methods—particularly the user-token embedding—perform best when annotator agreement is low, while the novel composite embedding offers improvements when agreement is high; multi-task models often underperform and are more computationally expensive. The results highlight that the most robust predictor of performance is the number of annotations per annotator, guiding future corpus construction and perspectivist NLP research. The authors provide open-source code and extensive trial statistics to support further exploration of annotator modeling and corpus design.

Abstract

Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP.

Corpus Considerations for Annotator Modeling and Scaling

TL;DR

This work addresses the challenge of capturing diverse annotator perspectives in subjective NLP tasks by systematically evaluating annotator-modeling methods across seven binary-label corpora. It introduces and compares approaches including User Token, Composite Embedding, Composite Embedding with User Token, Multi-task, and Personalization techniques, using macro-F1 as the evaluation metric. The key finding is that simple, scalable methods—particularly the user-token embedding—perform best when annotator agreement is low, while the novel composite embedding offers improvements when agreement is high; multi-task models often underperform and are more computationally expensive. The results highlight that the most robust predictor of performance is the number of annotations per annotator, guiding future corpus construction and perspectivist NLP research. The authors provide open-source code and extensive trial statistics to support further exploration of annotator modeling and corpus design.

Abstract

Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP.
Paper Structure (16 sections, 2 equations, 2 figures, 5 tables)

This paper contains 16 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: GoEmotions mean performance across emotions when scaling the number of annotators. The SBERT baseline is indicated by the dashed line. Shaded regions correspond to 95% confidence intervals.
  • Figure 2: Relative performance increase in F1 as a function of the number of annotations per annotator.