TODO: Enhancing LLM Alignment with Ternary Preferences
Yuxiang Guo, Lu Yin, Bo Jiang, Jiaqi Zhang
TL;DR
This work identifies a core limitation of binary Bradley-Terry–based alignment methods (like DPO) when human preferences contain ties or noise. It introduces Tie-rank Oriented BT (TOBT) to model ternary relations and the Tie-rank Oriented Direct Preference Optimization (TODO) to optimize across clear preferences and ties, with objective terms L^p_TODO and L^t_TODO and a margin α that modulates non-tie updates. The approach is validated on Mistral-7B and Llama 3-8B across in-distribution and out-of-distribution data, showing consistent improvements over DPO on MT Bench, Piqa, ARC, Hellaswag, MMLU, Winogrande, and Reward Bench, and demonstrating useful performance even when restricted to binary data. The combination of TOBT and TODO supports richer, more robust LLM alignment, with potential for integration into RLHF reward models and offline/on-policy preference optimization pipelines. The authors provide an open-source implementation at the referenced GitHub repository, enabling broader adoption and extension.
Abstract
Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences -- particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT's ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO's superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The implementation details can be found in https://github.com/XXares/TODO.
