Table of Contents
Fetching ...

Compare without Despair: Reliable Preference Evaluation with Generation Separability

Sayan Ghosh, Tejas Srinivasan, Swabha Swayamdipta

TL;DR

This work introduces a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation, and incorporates separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs.

Abstract

Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.

Compare without Despair: Reliable Preference Evaluation with Generation Separability

TL;DR

This work introduces a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation, and incorporates separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs.

Abstract

Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.
Paper Structure (22 sections, 7 equations, 16 figures, 7 tables)

This paper contains 22 sections, 7 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Illustration of separability on SAMSum dialog summarization from our human experiments (§\ref{['sec:human-study']}). Test instances have varying degrees of separability, which lead to different levels of consistency in preference ratings. For lower separability instances, the choice of which pair of sampled generations to show raters affects human rating (raters preferred Model A under Pair 1 and B under Pair 2); hence the overall judgment between model pairs is inconsistent. Human preferences are consistent under higher separability (raters always preferred Model B).
  • Figure 2: Four scenarios illustrating the intuition behind separability. Blue and gold circles represent generations from models $m_A$ and $m_B$ respectively, and Euclidean distances represent (dis)similarities between them. For a given input, at least one of the two models needs to have higher similarity among its own generations (high self-alignment) to have high separability for that input. High similarity across generations from different models (high cross-alignment) leads to lower separability. High self-alignment corresponds to low spread of a set of same-colored circles and vice-versa. High cross-alignment corresponds to low spread of the entire set of circles and vice-versa.
  • Figure 3: Histograms of separability distributions for summarization (Left) and translation (Right). For similar model pairs, CNN/DailyMail for news summarization and translation from a high-resource language (German) have lower average separability compared to SAMSum for dialogue summarization and translation from a lower-resource language (Czech). We use length-adjusted BERTScore zhangbertscore (defined in Section \ref{['sec:sim-fcns']}) as the similarity metric for summarization and BLEUpapineni-etal-2002-bleu for translation.
  • Figure 4: separability distributions for ART and BiSECT. We use length-adjusted BERTScore here (defined in Section \ref{['sec:sim-fcns']}) as the similarity metric. separability has higher variance, especially for BiSECT, largely caused by differences in instruction prompt interpretation; see \ref{['appendix:examples']}.
  • Figure 5: separability is robust to changes in the temperature $\tau$ used for generation (Left), the number of samples used to estimate alignments $K$ (Middle), and the number of cross-alignment comparisons $C$ (Right), for GPT-3.5 vs. Vicuna-7B on SAMSum.
  • ...and 11 more figures