Token-Importance Guided Direct Preference Optimization
Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, Haijun Zhang
TL;DR
This work introduces TI-DPO, a token-level Direct Preference Optimization framework that steers large language models with fine-grained semantic control. It integrates a hybrid token-importance weighting scheme—combining gradient attribution with a Gaussian prior—to robustly identify critical tokens, and a triplet loss structure to enforce progressive alignment toward preferred outputs while distancing from non-preferred ones. The authors provide theoretical guarantees, including a tighter loss bound and improved expected reward under a fixed KL budget, and demonstrate empirical gains across multiple benchmarks and base models, notably achieving an average score of 62.3 and strong HumanEval, TruthfulQA, and IFEval results. While TI-DPO incurs higher training cost due to the additional gradient computation, it yields improved alignment stability, diversity, and safety, with future work aiming to reduce overhead and enhance reasoning performance through integration with group-based optimization approaches.
Abstract
Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations. First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.
