Table of Contents
Fetching ...

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, Aditya Grover

TL;DR

This work reframes LLM alignment by introducing Joint Preference Optimization (JPO), which learns from preferences jointly over instruction–response pairs rather than conditioning on a fixed context. By upweighting the joint likelihood of chosen instruction–response pairs relative to rejected ones, JPO subsumes prior conditional-ranking approaches like DPO and yields robust improvements on summarization and dialogue tasks, including AlpacaEval2. Empirical results show JPO achieving higher win-rates against gold responses and outperforming baselines across multiple datasets, with benefits persisting as data scales. The study also reveals that joint preferences expose context-dependent decision heuristics and argue for broader evaluation paradigms beyond traditional conditional rankings, while acknowledging limitations and directions for efficient data selection and diverse annotation.

Abstract

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by $5.2\%$ and $3.3\%$ win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

TL;DR

This work reframes LLM alignment by introducing Joint Preference Optimization (JPO), which learns from preferences jointly over instruction–response pairs rather than conditioning on a fixed context. By upweighting the joint likelihood of chosen instruction–response pairs relative to rejected ones, JPO subsumes prior conditional-ranking approaches like DPO and yields robust improvements on summarization and dialogue tasks, including AlpacaEval2. Empirical results show JPO achieving higher win-rates against gold responses and outperforming baselines across multiple datasets, with benefits persisting as data scales. The study also reveals that joint preferences expose context-dependent decision heuristics and argue for broader evaluation paradigms beyond traditional conditional rankings, while acknowledging limitations and directions for efficient data selection and diverse annotation.

Abstract

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by and win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.
Paper Structure (41 sections, 1 theorem, 3 equations, 16 figures, 7 tables)

This paper contains 41 sections, 1 theorem, 3 equations, 16 figures, 7 tables.

Key Result

Lemma D.1

Under the case where $\mathcal{D}_{X} = \{(I_i, R_i, I_i, R_j)\}$, that is, prompts are the same for preferred and not-preferred prompt generation pairs, $\mathcal{L}_{\textsc{DPO}\xspace}(\theta; \mathcal{D}_C, \beta, p_{\text{ref}}) = \mathcal{L}_{\textsc{JPO}\xspace}(\theta; \mathcal{D}_X, \beta,

Figures (16)

  • Figure 1: Overview of the Joint Preference Optimization. (Left) We show that the conditional preference acquisition method would require the annotators to compare two responses for an identical instruction. (Right) We show that the annotators can also assign rankings jointly over instruction-response pairs. Specifically, the annotator prefers a helpful response (e.g., Apple ... Grape) over a response that ignores the context of the instruction (e.g., wear sunscreen ... litter). Our framework thus elicits preferences that are obfuscated in the prior approach.
  • Figure 2: Results for the preferences acquired jointly over the instruction-response pairs where both the responses were either chosen or rejected under the conditional rankings protocol. Here, decisive implies that the annotators could assign a preference to one instruction-response pair over the other. Here, AH means Anthropic-Helpful.
  • Figure 3: Results for the preferences acquired jointly over the instruction-response pairs where one of the instruction-response pair was chosen (C) and the other pair was rejected (R) under the conditional rankings. Here, $C < R$ implies that the instruction-response pair that was rejected under conditional rankings is actually preferred over an instruction-response pair that was rejected under the conditional rankings. Here, AH means Anthropic-Helpful.
  • Figure 4: Results for aligning LLMs with JPO. We utilize ChatGPT to compare the model responses with the gold responses. In 4a and 4b we report the results averaged over three runs of the preference optimization objectives and three sampling temperatures. In 4c, we report the results for temeperature set at 0.7 for AlpacaEval2.
  • Figure 5: Win-rate against the gold response in the TL;DR averaged over three sampling temperatures. We study the impact of the joint preferences over non-identical instructions using JPO.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Lemma D.1
  • proof