Table of Contents
Fetching ...

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji

TL;DR

The paper tackles the challenge of multi-image understanding in multimodal LLMs, where cross-modal misalignment leads to hallucinations. It introduces Context-to-Cue Direct Preference Optimization (CcDPO), a two-level hierarchical DPO framework consisting of Context-Level language-based captioning and Needle-Level region-focused plus vision-contrastive optimization, supported by the automated MultiScope-42k dataset. Empirical results across diverse multi-image benchmarks show reduced hallucinations and consistent gains in both multi-image and some single-image tasks, while ablations validate the benefits of the two-stage training and large-scale, structured supervision. The approach offers a scalable, data-efficient path to improved cross-image reasoning and grounding in MLLMs, with potential extensions to temporal data and enhanced OCR grounding.

Abstract

Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

TL;DR

The paper tackles the challenge of multi-image understanding in multimodal LLMs, where cross-modal misalignment leads to hallucinations. It introduces Context-to-Cue Direct Preference Optimization (CcDPO), a two-level hierarchical DPO framework consisting of Context-Level language-based captioning and Needle-Level region-focused plus vision-contrastive optimization, supported by the automated MultiScope-42k dataset. Empirical results across diverse multi-image benchmarks show reduced hallucinations and consistent gains in both multi-image and some single-image tasks, while ablations validate the benefits of the two-stage training and large-scale, structured supervision. The approach offers a scalable, data-efficient path to improved cross-image reasoning and grounding in MLLMs, with potential extensions to temporal data and enhanced OCR grounding.

Abstract

Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.

Paper Structure

This paper contains 21 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: (a) Prior multi-image DPO (e.g., MIA-DPO) is constrained by its reliance on predefined image references and text-only preferences, limiting holistic context modeling. (b) These limitations commonly lead to failures such as Context Omission (ignoring relevant images), Context Conflation (misattributing content across images), and Detail Misinterpretation (misrepresenting fine-grained visual cues). (c) CcDPO addresses these issues by hierarchically enhancing MLLMs' visual perception, from overall multi-image contexts to specific fine-grained details. (d) Benchmark comparisons demonstrate CcDPO's improved reasoning capabilities on both multi-image and single-image tasks.
  • Figure 2: (a) Baseline: Direct inference without context as a condition. (b) Two-stage approach: Generating image captions, then reasoning over them. (c) Performance: Accurate caption understanding as context substantially improves VQA accuracy, with noisy captions also proving beneficial. This highlights deficient intrinsic captioning in MLLMs as a key bottleneck, motivating its enhancement.
  • Figure 3: Overview of CcDPO. (a) Caption pools are built from LLaVA-23K liu2023visual, MDVP lin2024draw, and MVC wu2025symmetrical for image- and region-level supervision. (b) Context-Level DPO aligns model outputs with complete, coherent image sequences and penalizes omissions, conflation, and misalignments. (c) Needle-Level DPO incorporates visual prompts to enhance local detail understanding. chosen responses describe marked regions accurately, while rejected are drawn from mismatched regions. Both language-based and vision-contrastive preferences are used to sharpen fine-grained perception.
  • Figure 4: Token length distributions of chosen and rejected responses in our MultiScope-42k and MIA-DPO mia-dpo. MultiScope-42k exhibits significantly longer and more diverse answers, while MIA-DPO responses remain short and concentrated, indicating a simpler response pattern.
  • Figure 5: The dataset word cloud comparison between our MultiScope-42k and MIA-DPO mia-dpo.
  • ...and 6 more figures