Table of Contents
Fetching ...

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

TL;DR

Methods for identifying diverging preferences are developed to mitigate their influence on evaluation and training in LLM evaluations and in developing pluralistically aligned LLMs.

Abstract

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.

Diverging Preferences: When do Annotators Disagree and do Models Know?

TL;DR

Methods for identifying diverging preferences are developed to mitigate their influence on evaluation and training in LLM evaluations and in developing pluralistically aligned LLMs.

Abstract

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.

Paper Structure

This paper contains 26 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Disagreements between pairs of annotators in MultiPref-Disagreements (left) and HelpSteer2-Disagreements (right). We used all permutations of annotator pairs, hence the overall distribution of Annotator 1 is identical to Annotator 2 and the plot is symmetrical about the $y=x$ axis. Along the $y=x$ line, annotators agree perfectly with each other.
  • Figure 2: Histograms of differences between the chosen and rejected responses predicted by our Bradley-Terry reward model trained on aggregated labels from MultiPref, evaluated on test examples with different levels of agreement. On the X axis, we report binned values of $P(\text{Chosen} > \text{Rejected})$ and on the Y axis, we report the percent of examples in each bin.
  • Figure 3: PDF from Mean-Variance Reward Models (KL)'s predictions on 3 examples and our mapping from $r_A - r_B$ to preference labels used during training. Area under the curve in each region is used to compute the probability of a response being labeled as significantly preferred ($A >> B$), slightly preferred ($A > B$), or tied ($A = B$).