Table of Contents
Fetching ...

Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

Jian Li, Shenglin Yin, Yujia Zhang, Alan Zhao, Xi Chen, Xiaohui Zhou, Pengfei Xu

TL;DR

Ambiguity in token content within preference pairs can hinder Direct Preference Optimization. The authors introduce Ambiguity Awareness Optimization (AAO), a lightweight, LM-embedded re-weighting scheme with adaptive thresholds to down-weight ambiguous tokens and emphasize core preference signals. The approach provides theoretical motivation, a practical identification and weighting pipeline, and extensive experiments showing state-of-the-art improvements across AlpacaEval2, Arena-Hard, MT-Bench, and safety benchmarks, with minimal overhead. Overall, AAO offers a practical, plug-in enhancement to DPO that mitigates the squeeze effect and improves alignment performance across model scales.

Abstract

Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.

Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

TL;DR

Ambiguity in token content within preference pairs can hinder Direct Preference Optimization. The authors introduce Ambiguity Awareness Optimization (AAO), a lightweight, LM-embedded re-weighting scheme with adaptive thresholds to down-weight ambiguous tokens and emphasize core preference signals. The approach provides theoretical motivation, a practical identification and weighting pipeline, and extensive experiments showing state-of-the-art improvements across AlpacaEval2, Arena-Hard, MT-Bench, and safety benchmarks, with minimal overhead. Overall, AAO offers a practical, plug-in enhancement to DPO that mitigates the squeeze effect and improves alignment performance across model scales.

Abstract

Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.

Paper Structure

This paper contains 15 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Diagram of our proposed AAO, in which background tokens among preference answers are re-weighted when computing cross-entropy loss. Firstly, AAO tokenizes preference pairs using a language model and encoding them into corresponding embeddings. Then AAO calculates the semantic similarity of the embeddings with cosine distance and decides the background tokens with a adaptive threshold. Finally, the background tokens are re-weighted during training.
  • Figure 2: Images of different weighted curves. In our approach, the thresholds $a$ and $b$ are decided by LLM itself during training.
  • Figure 3: Effect of auxiliary losses on experimental results.
  • Figure 4: AAO mitigates the squeeze effect of DPO.