Table of Contents
Fetching ...

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng

TL;DR

This work tackles suboptimal preference optimization in Direct Preference Optimization (DPO) caused by uniform token weighting. It introduces OTPO, an unsupervised token weighting scheme based on unbalanced optimal transport that aligns tokens across chosen and rejected responses using last-layer representations, producing a semantically aware reward difference $\Delta_{\hat{r}}$. OTPO demonstrates improved instruction-following performance and reduced length bias across multiple backbones and tasks, achieving up to a $10.9\%$ gain in length-controlled win-rate on AlpacaEval2 and an $8.6\%$ improvement in TL;DR summarization. The approach emphasizes meaningful token interactions, offers interpretability via transport plans, and maintains practical efficiency with negligible overhead relative to transformer-based training.

Abstract

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

TL;DR

This work tackles suboptimal preference optimization in Direct Preference Optimization (DPO) caused by uniform token weighting. It introduces OTPO, an unsupervised token weighting scheme based on unbalanced optimal transport that aligns tokens across chosen and rejected responses using last-layer representations, producing a semantically aware reward difference . OTPO demonstrates improved instruction-following performance and reduced length bias across multiple backbones and tasks, achieving up to a gain in length-controlled win-rate on AlpacaEval2 and an improvement in TL;DR summarization. The approach emphasizes meaningful token interactions, offers interpretability via transport plans, and maintains practical efficiency with negligible overhead relative to transformer-based training.

Abstract

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.

Paper Structure

This paper contains 33 sections, 27 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The uniform weighting in DPO leads to suboptimal alignment results, allowing less relevant signals to dominate. OTPO identifies the contextually similar parts in pairwise responses as targets and upweights target signals. $r(y_*)$ denotes the estimated reward under each method.
  • Figure 2: Overall framework. (a) We compute the token-level weighting scheme using optimal transport. Each response's distribution is made up of its tokens, represented as vectors in the LLM's representation space. The optimized transport plan is visualized using a Sankey diagram. (b) We decompose the DPO loss at the token level and apply the weighting scheme obtained in (a).
  • Figure 3: Weights assigned to the responses given different methods. Here, given the prompt "What is the capital of Paris?", the chosen response is "The capital of France is Paris, a major European city known for its art, culture, and history.", and the rejected response is "France’s big city? That’s Paris, I guess. The birthplace of Victor Hugo, Émile Zola, Charles Baudelaire."
  • Figure 4: TL;DR summarization win rates compared to the base model, using GPT-4o as the evaluator. OTPO exceeds the existing methods by a large margin.
  • Figure 5: Trend of gradient norm during training.
  • ...and 7 more figures