Table of Contents
Fetching ...

TODO: Enhancing LLM Alignment with Ternary Preferences

Yuxiang Guo, Lu Yin, Bo Jiang, Jiaqi Zhang

TL;DR

This work identifies a core limitation of binary Bradley-Terry–based alignment methods (like DPO) when human preferences contain ties or noise. It introduces Tie-rank Oriented BT (TOBT) to model ternary relations and the Tie-rank Oriented Direct Preference Optimization (TODO) to optimize across clear preferences and ties, with objective terms L^p_TODO and L^t_TODO and a margin α that modulates non-tie updates. The approach is validated on Mistral-7B and Llama 3-8B across in-distribution and out-of-distribution data, showing consistent improvements over DPO on MT Bench, Piqa, ARC, Hellaswag, MMLU, Winogrande, and Reward Bench, and demonstrating useful performance even when restricted to binary data. The combination of TOBT and TODO supports richer, more robust LLM alignment, with potential for integration into RLHF reward models and offline/on-policy preference optimization pipelines. The authors provide an open-source implementation at the referenced GitHub repository, enabling broader adoption and extension.

Abstract

Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences -- particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT's ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO's superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The implementation details can be found in https://github.com/XXares/TODO.

TODO: Enhancing LLM Alignment with Ternary Preferences

TL;DR

This work identifies a core limitation of binary Bradley-Terry–based alignment methods (like DPO) when human preferences contain ties or noise. It introduces Tie-rank Oriented BT (TOBT) to model ternary relations and the Tie-rank Oriented Direct Preference Optimization (TODO) to optimize across clear preferences and ties, with objective terms L^p_TODO and L^t_TODO and a margin α that modulates non-tie updates. The approach is validated on Mistral-7B and Llama 3-8B across in-distribution and out-of-distribution data, showing consistent improvements over DPO on MT Bench, Piqa, ARC, Hellaswag, MMLU, Winogrande, and Reward Bench, and demonstrating useful performance even when restricted to binary data. The combination of TOBT and TODO supports richer, more robust LLM alignment, with potential for integration into RLHF reward models and offline/on-policy preference optimization pipelines. The authors provide an open-source implementation at the referenced GitHub repository, enabling broader adoption and extension.

Abstract

Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences -- particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT's ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO's superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The implementation details can be found in https://github.com/XXares/TODO.

Paper Structure

This paper contains 35 sections, 38 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Comparison of DPO and TODO. DPO relies on the BT model, which is only capable of handling binary preferences. When responses are tied, it either learns incorrect preference information or discards tied data, preventing learning from such data. In contrast, the proposed TOBT model can directly model ternary preferences. Based on this, TODO can learn more information from tied data and exhibits better robustness against potential noise in binary preference data.
  • Figure 2: Accuracy of Mistral and Llama 3 models aligned with DPO and TODO on non-tie preference test set and Reward Bench. The X-axis denotes the proportion of tie data mixed in train set.
  • Figure 3: MT Bench results of Mistral-7B and Llama 3-8B. The models are aligned with DPO and TODO using datasets with varying ratios of tie data.
  • Figure 4: Human evaluation of pairwise responses across various prompts in the Chatbot Arena.
  • Figure 5: The initial preference and tie losses simulated with different $\alpha$ values.
  • ...and 1 more figures