Table of Contents
Fetching ...

Leveraging Annotator Disagreement for Text Classification

Jin Xu, Mariët Theune, Daniel Braun

TL;DR

Three different strategies to leverage annotator disagreement for text classification are proposed and compared: a probability-based multi-label method, an ensemble system, and instruction tuning, which show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance.

Abstract

It is common practice in text classification to only use one majority label for model training even if a dataset has been annotated by multiple annotators. Doing so can remove valuable nuances and diverse perspectives inherent in the annotators' assessments. This paper proposes and compares three different strategies to leverage annotator disagreement for text classification: a probability-based multi-label method, an ensemble system, and instruction tuning. All three approaches are evaluated on the tasks of hate speech and abusive conversation detection, which inherently entail a high degree of subjectivity. Moreover, to evaluate the effectiveness of embracing annotation disagreements for model training, we conduct an online survey that compares the performance of the multi-label model against a baseline model, which is trained with the majority label. The results show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance. The results of the survey also show that the outputs from the multi-label models are considered a better representation of the texts than the single-label model.

Leveraging Annotator Disagreement for Text Classification

TL;DR

Three different strategies to leverage annotator disagreement for text classification are proposed and compared: a probability-based multi-label method, an ensemble system, and instruction tuning, which show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance.

Abstract

It is common practice in text classification to only use one majority label for model training even if a dataset has been annotated by multiple annotators. Doing so can remove valuable nuances and diverse perspectives inherent in the annotators' assessments. This paper proposes and compares three different strategies to leverage annotator disagreement for text classification: a probability-based multi-label method, an ensemble system, and instruction tuning. All three approaches are evaluated on the tasks of hate speech and abusive conversation detection, which inherently entail a high degree of subjectivity. Moreover, to evaluate the effectiveness of embracing annotation disagreements for model training, we conduct an online survey that compares the performance of the multi-label model against a baseline model, which is trained with the majority label. The results show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance. The results of the survey also show that the outputs from the multi-label models are considered a better representation of the texts than the single-label model.
Paper Structure (22 sections, 8 figures, 2 tables)

This paper contains 22 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The framework of model training within the probability-based multi-label method.
  • Figure 2: Fine-tuning BERT individually as sub-models within the ensemble system.
  • Figure 3: Fine-tuning LLaMa 2 as a sub-model with instruction tuning in the hate speech dataset.
  • Figure 4: Fine-tuning LLaMa 2 as a sub-model with instruction tuning in the abusive conversation dataset.
  • Figure 5: Comparison of the ensemble system’s performances on two tasks.
  • ...and 3 more figures