Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
TL;DR
This work tackles suboptimal preference optimization in Direct Preference Optimization (DPO) caused by uniform token weighting. It introduces OTPO, an unsupervised token weighting scheme based on unbalanced optimal transport that aligns tokens across chosen and rejected responses using last-layer representations, producing a semantically aware reward difference $\Delta_{\hat{r}}$. OTPO demonstrates improved instruction-following performance and reduced length bias across multiple backbones and tasks, achieving up to a $10.9\%$ gain in length-controlled win-rate on AlpacaEval2 and an $8.6\%$ improvement in TL;DR summarization. The approach emphasizes meaningful token interactions, offers interpretability via transport plans, and maintains practical efficiency with negligible overhead relative to transformer-based training.
Abstract
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
