Table of Contents
Fetching ...

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

Jihan Yao, Wenxuan Ding, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov

TL;DR

This work employs methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences.

Abstract

In the absence of abundant reliable annotations for challenging tasks and contexts, how can we expand the frontier of LLM capabilities with potentially wrong answers? We focus on two research questions: (1) Can LLMs generate reliable preferences among wrong options? And if so, (2) Would alignment with such wrong-over-wrong preferences be helpful? We employ methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences. Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while overall improving model calibration.

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

TL;DR

This work employs methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences.

Abstract

In the absence of abundant reliable annotations for challenging tasks and contexts, how can we expand the frontier of LLM capabilities with potentially wrong answers? We focus on two research questions: (1) Can LLMs generate reliable preferences among wrong options? And if so, (2) Would alignment with such wrong-over-wrong preferences be helpful? We employ methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences. Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while overall improving model calibration.

Paper Structure

This paper contains 50 sections, 5 equations, 6 figures, 24 tables, 1 algorithm.

Figures (6)

  • Figure 1: Two phases of aligning LLMs with wrong answers: eliciting wrong-over-wrong preferences and wrong-over-wrong alignment. In Phase 1, we employ four methods to elicit wrong-over-wrong preferences, based on answer consistency, logits-based confidence, and LLM-as-a-judge approaches. In Phase 2, we align LLMs with wrong-over-wrong preferences using DPO and expect to have less wrong, more correct, and better-calibrated answers.
  • Figure 2: Correlation between task accuracy, confidence and $\mathrm{Acc}_{\textit{WoW}}$ of score-based eliciting with $M_{\textit{10}}$. Data points are from all 3 LLMs we used to elicit wrong-over-wrong preferences. $P$ stands for Pearson correlation coefficient. The ability to elicit wrong-over-wrong preferences is positively correlated with task ability but negatively correlated with confidence.
  • Figure 3: Correlation between $\mathrm{Acc}_{\textit{WoW}}$ and improvement after wrong-over-wrong alignment in less wrong $\Delta p_{\textit{wrong}}$, more correct $\Delta \mathrm{Acc}$, and better calibration $-\Delta \mathrm{ECE}$. Data points are sourced from all 4 methods ($f_{\textit{GPT-4o}}^{(\textit{p})}$ with consistency checks, $f_{\textit{GPT-4o}}^{(\textit{s})}$ with $M_{\textit{50}}$ and $M_{\textit{10}}$, and oracle $\hat{f}$), and the oracle method is considered as $\mathrm{Acc}_{\textit{WoW}}$ = 1.0. The black line is the linear regression on all four datasets while the green line is the linear regression on Bio Generation dataset. Wrong-over-wrong alignment is not sensitive to the accuracy of wrong-over-wrong preference.
  • Figure 4: Evaluation different of preference optimization methods on less wrong, more correct and better calibration. The number is averaged over six experiment setups ((pairwise comparison with consistency check, score-based with $M_{\textit{10}}$, score-based with $M_{\textit{50}}$) $\times$ (Self-Generator, Mix-Generator)) on NLGraph dataset.
  • Figure 5: Correlation between task accuracy, confidence and $\mathrm{Acc}_{\textit{WoW}}$ for pairwise comparison with consistency check. Data points are from all 3 LLMs we used to elicit wong-over-wrong preferences. $P$ stands for Pearson correlation coefficient.
  • ...and 1 more figures