Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, Aditya Grover
TL;DR
This work reframes LLM alignment by introducing Joint Preference Optimization (JPO), which learns from preferences jointly over instruction–response pairs rather than conditioning on a fixed context. By upweighting the joint likelihood of chosen instruction–response pairs relative to rejected ones, JPO subsumes prior conditional-ranking approaches like DPO and yields robust improvements on summarization and dialogue tasks, including AlpacaEval2. Empirical results show JPO achieving higher win-rates against gold responses and outperforming baselines across multiple datasets, with benefits persisting as data scales. The study also reveals that joint preferences expose context-dependent decision heuristics and argue for broader evaluation paradigms beyond traditional conditional rankings, while acknowledging limitations and directions for efficient data selection and diverse annotation.
Abstract
A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by $5.2\%$ and $3.3\%$ win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.
