Table of Contents
Fetching ...

ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Yeyuan Wang, Dehong Gao, Rujiao Long, Lei Yi, Linbo Jin, Libin Yang, Xiaoyan Cai

TL;DR

ASPO addresses the limitations of binary, response-level preference optimization by introducing sentence-level adaptive rewards for multimodal reasoning. It computes per-sentence weights from image-text similarity via CLIP and textual perplexity, then integrates these into a modified DPO loss to achieve fine-grained alignment without extra models or data. Across multiple open-source MLLMs and benchmarks, ASPO yields consistent improvements and reduces hallucinations, demonstrating the value of granular supervision for multimodal preference optimization. The approach offers a practical and scalable enhancement to human-aligned multimodal systems with broad applicability in VQA, image captioning, and reasoning tasks.

Abstract

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.

ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

TL;DR

ASPO addresses the limitations of binary, response-level preference optimization by introducing sentence-level adaptive rewards for multimodal reasoning. It computes per-sentence weights from image-text similarity via CLIP and textual perplexity, then integrates these into a modified DPO loss to achieve fine-grained alignment without extra models or data. Across multiple open-source MLLMs and benchmarks, ASPO yields consistent improvements and reduces hallucinations, demonstrating the value of granular supervision for multimodal preference optimization. The approach offers a practical and scalable enhancement to human-aligned multimodal systems with broad applicability in VQA, image captioning, and reasoning tasks.

Abstract

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.

Paper Structure

This paper contains 15 sections, 13 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between DPO and our ASPO. Unlike the traditional DPO algorithm that calculates implicit rewards at the response level, ASPO calculates adaptive rewards using sentences as the smallest units to achieve a fine-grained self-supervised preference optimization method.
  • Figure 2: Data collection pipeline.
  • Figure 3: Ablation study of the metric-weighting $\alpha$.