Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji
TL;DR
The paper tackles the challenge of multi-image understanding in multimodal LLMs, where cross-modal misalignment leads to hallucinations. It introduces Context-to-Cue Direct Preference Optimization (CcDPO), a two-level hierarchical DPO framework consisting of Context-Level language-based captioning and Needle-Level region-focused plus vision-contrastive optimization, supported by the automated MultiScope-42k dataset. Empirical results across diverse multi-image benchmarks show reduced hallucinations and consistent gains in both multi-image and some single-image tasks, while ablations validate the benefits of the two-stage training and large-scale, structured supervision. The approach offers a scalable, data-efficient path to improved cross-image reasoning and grounding in MLLMs, with potential extensions to temporal data and enhanced OCR grounding.
Abstract
Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.
