Table of Contents
Fetching ...

Token-Importance Guided Direct Preference Optimization

Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, Haijun Zhang

TL;DR

This work introduces TI-DPO, a token-level Direct Preference Optimization framework that steers large language models with fine-grained semantic control. It integrates a hybrid token-importance weighting scheme—combining gradient attribution with a Gaussian prior—to robustly identify critical tokens, and a triplet loss structure to enforce progressive alignment toward preferred outputs while distancing from non-preferred ones. The authors provide theoretical guarantees, including a tighter loss bound and improved expected reward under a fixed KL budget, and demonstrate empirical gains across multiple benchmarks and base models, notably achieving an average score of 62.3 and strong HumanEval, TruthfulQA, and IFEval results. While TI-DPO incurs higher training cost due to the additional gradient computation, it yields improved alignment stability, diversity, and safety, with future work aiming to reduce overhead and enhance reasoning performance through integration with group-based optimization approaches.

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations. First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

Token-Importance Guided Direct Preference Optimization

TL;DR

This work introduces TI-DPO, a token-level Direct Preference Optimization framework that steers large language models with fine-grained semantic control. It integrates a hybrid token-importance weighting scheme—combining gradient attribution with a Gaussian prior—to robustly identify critical tokens, and a triplet loss structure to enforce progressive alignment toward preferred outputs while distancing from non-preferred ones. The authors provide theoretical guarantees, including a tighter loss bound and improved expected reward under a fixed KL budget, and demonstrate empirical gains across multiple benchmarks and base models, notably achieving an average score of 62.3 and strong HumanEval, TruthfulQA, and IFEval results. While TI-DPO incurs higher training cost due to the additional gradient computation, it yields improved alignment stability, diversity, and safety, with future work aiming to reduce overhead and enhance reasoning performance through integration with group-based optimization approaches.

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations. First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

Paper Structure

This paper contains 35 sections, 3 theorems, 37 equations, 6 figures, 13 tables, 1 algorithm.

Key Result

Lemma 1

Consider a reward signal governed by a sparse set of critical tokens, such that the subset of non-critical tokens $\mathcal{N}$ contributes only independent zero-mean noise $\epsilon_t$ with variance $\sigma^2_{\epsilon}$. Provided that the importance weights for these non-critical tokens are suppre

Figures (6)

  • Figure 1: Multi-dimensional normalized score of TI-DPO compared with other base instruction models across categories.
  • Figure 2: Accuracy trends with training steps for different methods on TruthfulQA and IFEval tasks on LLaMA-3.1-8B. The performance comparisons of SFT, DPO, IPO, KTO, SimPO, TDPO, CPO, TPO, GRPO, and TI-DPO are illustrated.
  • Figure B1: Distribution patterns of gradient-based token importance weights in six benchmark tasks (GSM8K, TruthfulQA, MMLU, GPQA, HumanEval, IFEval
  • Figure B2: Case demo of responses to prompt "I have a headache, what should I do?". Left: Preferred case. Middle: Intermediate case. Right: Non-preferred case. The darker color indicates higher weight.
  • Figure B3: Case demo of responses to prompt "I am overwhelmed by debt. What is the quickest way to get more money?". Left: Preferred case. Middle: Intermediate case. Right: Non-preferred case. The darker color indicates higher weight.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 1: Variance Reduction
  • Theorem 2: Tighter Loss Bound
  • Theorem 3: Superiority of Optimal Policy
  • proof