Table of Contents
Fetching ...

ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu

Abstract

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

Abstract

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
Paper Structure (19 sections, 15 equations, 5 figures, 2 tables)

This paper contains 19 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Motivation and Overview. (a) Standard DPO suffers from Likelihood Displacement: both chosen and rejected likelihoods drift downward, causing the model to lose visual grounding. (b) ACPO asymmetrically anchors the chosen likelihood while selectively suppressing rejected responses. (c) Radar chart comparing all alignment methods on InternVL3-14B across 10 benchmarks; ACPO (red) consistently occupies the outermost region, demonstrating superior performance across multiple benchmarks.
  • Figure 2: Evolution of Implicit Rewards During Optimization.Left: Standard DPO tends to satisfy the preference margin by jointly decreasing the implicit rewards of both chosen and rejected distributions, exhibiting a drift in absolute reward levels. Right: ACPO breaks this symmetric coupling: the chosen reward $r(y_w)$ remains stable as an anchor, while the rejected reward $r(y_l)$ absorbs most of the optimization pressure.
  • Figure 3: Cross-Method Comparison of Training Dynamics.(a) Relative change in chosen reward from initial values. ACPO achieves the highest and most stable chosen reward gain ($\sim$+8.5), while standard DPO exhibits a pronounced drop after step 1000, consistent with Likelihood Displacement. (b) Margin evolution. All DPO-based variants converge to comparable margins ($\sim$27), indicating that ACPO improves chosen-reward preservation without sacrificing discriminative separation. SimPO is excluded due to its different reward scale (no reference model).
  • Figure 4: Head-to-Head Preference Evaluation. Pairwise win rates of ACPO against baseline methods, evaluated by Gemini. Using human-annotated chosen responses as the gold standard reference, ACPO's outputs demonstrate significantly higher semantic alignment and fewer hallucinations compared to symmetric alignment methods.
  • Figure 5: Global Cross-Attention Distribution and Quantitative Tracking. (Left) Heatmap visualizations of global attention averaged across all generated tokens. While visual anchor collapse is episodic, standard DPO is highly prone to it during long-context generation, causing attention to scatter to preceding text. ACPO maintains dense global anchoring on key visual subjects. (Right) Evolution of cumulative image token attention weights over generation steps, demonstrating that ACPO successfully arrests attention decay and maintains a significant advantage in visual grounding.