Table of Contents
Fetching ...

Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations

Abhishek Anand, Negar Mokhberian, Prathyusha Naresh Kumar, Anweasha Saha, Zihao He, Ashwin Rao, Fred Morstatter, Kristina Lerman

TL;DR

This work shows that models that are only provided aggregated labels show low confidence on high-disagreement data instances, and investigates classifying using Multiple Ground Truth (Multi-GT) approaches, inspired by recent studies demonstrating the effectiveness of learning from raw annotations.

Abstract

Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.

Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations

TL;DR

This work shows that models that are only provided aggregated labels show low confidence on high-disagreement data instances, and investigates classifying using Multiple Ground Truth (Multi-GT) approaches, inspired by recent studies demonstrating the effectiveness of learning from raw annotations.

Abstract

Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.
Paper Structure (23 sections, 6 figures, 6 tables)

This paper contains 23 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Dataset Cartography map for Single-GT model on $\mathcal{D}_{\textsl{MDA}}$ (left), $\mathcal{D}_{\textsl{SI}}$ (center) and $\mathcal{D}_{\textsl{MHS}}$ (right). The x-axis shows variability and y-axis, the confidence. Further, the points are color-graded by correctness (probability the trained model assigns this data point to the ground truth label in its prediction). Samples in the top left corner with high confidence and low variability are easy for the model to learn, whereas sample that are in the lower left corner with low confidence and low variability are difficult.
  • Figure 2: Boxplots illustrating the relationship between model confidence and annotator agreement level $(a_m)$ for Single-GT model trained on $\mathcal{D}_{\textsl{MDA}}$ (left), $\mathcal{D}_{\textsl{SI}}$ (center) and $\mathcal{D}_{\textsl{MHS}}$ (right). There is a clear correlation between model's confidence in predicting the ground truth label and the agreement between annotators (denoted as the fraction of annotators that agree on the majority vote on the x-axis). We further depict significant differences in confidence distribution across agreement levels using the Mann-Whitney-Wilcoxon test mcknight2010mann with Statannotations florian_charlier_2022_7213391. Notation includes **** for $p <= 1.00e-04$.
  • Figure 3: Boxplot illustrating the relationship between model confidence and whether the annotator's annotation ($y_{n,m}$) disagrees with the majority vote ($\bar{{y}}_{.,m}$) for DisCo trained on $\mathcal{D}_{\textsl{MDA}}$ (left), $\mathcal{D}_{\textsl{SI}}$ (center) and $\mathcal{D}_{\textsl{MHS}}$ (right). We see a clear correlation indicating higher confidence in the predicted label by the model when $y_{n,m} = \bar{{y}}_{.,m}$ and lower confidence when $y_{n,m} \neq \bar{{y}}_{.,m}$. We further depict significant differences in confidence distribution for $y_{n,m} = \bar{{y}}_{.,m}$ and $y_{n,m} \neq \bar{{y}}_{.,m}$ using the Mann-Whitney-Wilcoxon test mcknight2010mann with Statannotations florian_charlier_2022_7213391. Notation includes **** for $p <= 1e-04$.
  • Figure 4: Boxplots illustrating the relationship between model confidence and whether the annotator's annotation ($y_{n,m}$) disagrees with the majority vote ($\bar{{y}}_{.,m}$) for DisCo trained on $\mathcal{D}_{\textsl{MDA}}$ (left), $\mathcal{D}_{\textsl{SI}}$ (center) and $\mathcal{D}_{\textsl{MHS}}$ (right) for DisCo only for the subset of samples where confidence is below 0.5 in Single-GT model. In contrast to the overall dataset presented in Figure \ref{['fig:DisCo_disagree_vs_conf']}, a reversed trend is observed, indicating higher confidence when $y_{n,m} \neq \bar{{y}}_{.,m}$ and lower confidence when $y_{n,m} = \bar{{y}}_{.,m}$. This highlights DisCo's ability to crucially learn from minority votes that are discarded for Single-GT model. We further depict significant differences in confidence distribution for $y_{n,m} = \bar{{y}}_{.,m}$ and $y_{n,m} \neq \bar{{y}}_{.,m}$ using the Mann-Whitney-Wilcoxon test mcknight2010mann with Statannotations florian_charlier_2022_7213391. Notation includes **** for p <= 1.00e-04.
  • Figure 5: Number of samples with disagreement in the dataset where annotator agreement level is strictly below 1.0 and they are grouped by the number of different labels DisCo learns with high confidence (above 0.5) for these samples for the dataset $\mathcal{D}_{\textsl{MDA}}$ (left), $\mathcal{D}_{\textsl{SI}}$ (center) and $\mathcal{D}_{\textsl{MHS}}$ (right).
  • ...and 1 more figures