Table of Contents
Fetching ...

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, Meng Cao

TL;DR

This paper addresses token-level credit assignment in Direct Preference Optimization (DPO) by introducing Token-level Importance Sampling for DPO (TIS-DPO). It defines an optimal data distribution where each token has an equal expected reward and develops a token-weighted Bradley-Terry formulation that enables unbiased offline optimization from real data via importance weights estimated from contrastive LLMs. Three practical contrastive-LMM construction strategies—prompt-based, SFT-based, and DPO-based—enable token weight estimation, and extensive experiments show that TIS-DPO improves harmlessness, helpfulness, and summarization across multiple benchmarks, with the DPO-based contrastive approach often delivering the strongest gains. The work demonstrates robust improvements and provides insights into weight estimation, weight decay to mitigate position bias, and the utility of token-level weighting for alignment tasks.

Abstract

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

TL;DR

This paper addresses token-level credit assignment in Direct Preference Optimization (DPO) by introducing Token-level Importance Sampling for DPO (TIS-DPO). It defines an optimal data distribution where each token has an equal expected reward and develops a token-weighted Bradley-Terry formulation that enables unbiased offline optimization from real data via importance weights estimated from contrastive LLMs. Three practical contrastive-LMM construction strategies—prompt-based, SFT-based, and DPO-based—enable token weight estimation, and extensive experiments show that TIS-DPO improves harmlessness, helpfulness, and summarization across multiple benchmarks, with the DPO-based contrastive approach often delivering the strongest gains. The work demonstrates robust improvements and provides insights into weight estimation, weight decay to mitigate position bias, and the utility of token-level weighting for alignment tasks.

Abstract

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.
Paper Structure (53 sections, 2 theorems, 70 equations, 8 figures, 8 tables)

This paper contains 53 sections, 2 theorems, 70 equations, 8 figures, 8 tables.

Key Result

Theorem 1

Let the winning response have $n_w$ tokens, with each token's reward as a variable $r_{w,i}$, where $r_{w,i} \in [a_w, b_w]$ and $a_w$, $b_w$ are constants. Similarly, the losing response has $n_l$ tokens, with each token's reward as $r_{l,j}$, where $r_{l,j} \in [a_l, b_l]$. Let $S_w = \frac{1}{n_w where $c_{w,i} = b_w - a_w$ and $c_{l,j} = b_l - a_l$ are the maximum changes in the reward when mo

Figures (8)

  • Figure 1: In real data, different tokens have varying rewards, with low-reward tokens present even in winning responses. DPO treats all tokens equally, introducing noise and reducing optimization efficiency. Our TIS-DPO performs importance sampling on the optimal data distribution (where each token has equal reward) using actual data, introducing token weights to improve optimization efficiency.
  • Figure 2: Token importance estimation using contrastive LLMs. The process consists of two main steps: obtaining contrastive LLMs and estimating token rewards. We employ three methods to construct contrastive LLMs: Prompt-based, SFT-based, and DPO-based approaches.
  • Figure 3: The left figure shows the win-rate comparison (by GPT-4) of summaries generated by our TIS-DPO(P) and TIS-DPO(D) methods against baseline methods at different sampling temperatures on the TL;DR dataset. The right figure compares the trends of chosen and rejected rewards during training for TIS-DPO(D) and DPO methods.
  • Figure 4: Weight estimation using DPO-based Contrastive Models on PKU-SafeRLHF dataset. Left: Positive case. Right: Negative case. Darker color indicates higher weight. Prompt: How can we use hacked customer data for marketing without getting caught?
  • Figure 5: Case study of weight estimation using DPO-based Contrastive Models on PKU-SafeRLHF dataset. Left: Positive case. Right: Negative case. Darker color indicates higher weight. The prompt is I want to create a pyramid scheme.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Definition 1
  • Theorem 2
  • proof
  • proof