Table of Contents
Fetching ...

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral

TL;DR

Triple Preference Optimization (TPO) introduces a one-step, multi-preference objective to align LLMs for both instruction-following and reasoning. By reparameterizing rewards as $r(x,y)=\beta \log \pi_\theta(y|x)$ and optimizing a combination of a preference term with gold-response regularization, TPO achieves gains over DPO and variants while using less data. The length-controlled variant, TPO-L, uses a reward margin to regulate verbosity, enabling robust performance across benchmarks and data regimes. Theoretical grounding in MERL and Bradley-Terry models, plus extensive experiments on base and instruction settings with Llama-3 and Mistral, demonstrate improved reward modeling, reduced optimization conflicts, and stronger robustness to judgment noise. Overall, TPO and TPO-L offer efficient, data-efficient alternatives for offline preference optimization with broad applicability to alignment tasks.

Abstract

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

TL;DR

Triple Preference Optimization (TPO) introduces a one-step, multi-preference objective to align LLMs for both instruction-following and reasoning. By reparameterizing rewards as and optimizing a combination of a preference term with gold-response regularization, TPO achieves gains over DPO and variants while using less data. The length-controlled variant, TPO-L, uses a reward margin to regulate verbosity, enabling robust performance across benchmarks and data regimes. Theoretical grounding in MERL and Bradley-Terry models, plus extensive experiments on base and instruction settings with Llama-3 and Mistral, demonstrate improved reward modeling, reduced optimization conflicts, and stronger robustness to judgment noise. Overall, TPO and TPO-L offer efficient, data-efficient alternatives for offline preference optimization with broad applicability to alignment tasks.

Abstract

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.
Paper Structure (61 sections, 3 theorems, 42 equations, 12 figures, 12 tables)

This paper contains 61 sections, 3 theorems, 42 equations, 12 figures, 12 tables.

Key Result

Lemma 1

Under the Plackett-Luce, and in particular the Bradley-Terry preference framework, two reward functions from the same class induce the same preference distribution.rafailov2024direct

Figures (12)

  • Figure 1: TPO and TPO-L differ by removing the reference model and adding behavioral cloning objective with a regularization term for gold preferences, distinct from preferred and rejected responses. TPO and TPO-L outperform DPO in instruction following and reasoning benchmarks simultaneously.
  • Figure 2: Comparison of improvements achieved during the post-training stage, as measured by the DAA metric, by evaluating the performance of the SFT checkpoint against the preference optimization checkpoint on downstream tasks (More details in Appendix \ref{['sec:app_down_stream_tasks']}).
  • Figure 3: Overview of the data and optimization processing. Left Top: Visualization of the data structure in the UltraFeedback dataset. Right Top: Selection of gold, preferred (chosen), and rejected responses based on overall scores generated by GPT-4. Bottom: Optimization differences between TPO and DPO.
  • Figure 4: Comparison of Arena-Hard scores based on the average token length of generated responses for 500 prompts in the Arena-Hard benchmark across various settings.
  • Figure 5: Reward modeling exploration on UltraFeedback test set. Top: Comparison of reward distributions for SFT, SimPO, and TPO methods across varying data sizes. Bottom: Analysis of the impact of $\log \pi_\theta (y|x)$ as an implicit reward for SFT, SimPO, and TPO across different data sizes.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Lemma 2
  • Theorem 1