Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization
Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
TL;DR
Triple Preference Optimization (TPO) introduces a one-step, multi-preference objective to align LLMs for both instruction-following and reasoning. By reparameterizing rewards as $r(x,y)=\beta \log \pi_\theta(y|x)$ and optimizing a combination of a preference term with gold-response regularization, TPO achieves gains over DPO and variants while using less data. The length-controlled variant, TPO-L, uses a reward margin to regulate verbosity, enabling robust performance across benchmarks and data regimes. Theoretical grounding in MERL and Bradley-Terry models, plus extensive experiments on base and instruction settings with Llama-3 and Mistral, demonstrate improved reward modeling, reduced optimization conflicts, and stronger robustness to judgment noise. Overall, TPO and TPO-L offer efficient, data-efficient alternatives for offline preference optimization with broad applicability to alignment tasks.
Abstract
Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.
