Table of Contents
Fetching ...

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu, Yunhao Liu, Chengqi Zhang

TL;DR

HuLiRAG presents a cognitively inspired, what–where–reweight cascade to ground multimodal reasoning in region-level visual evidence, addressing hallucinations and rigid global grounding in MLLMs. The pre-stage retrieves candidate images, the What stage decomposes queries into phrases, the Where stage grounds phrases to precise masks via GroundingDINO and SAM, and the Reweight stage adaptively fuses global and local cues with a learnable balance. Spatially-aware fine-tuning further constraints generation using mask-guided supervision to enforce grounding in the answer, improving factual consistency on WebQA and MultimodalQA. Results show substantial retrieval gains and improved VQA performance across backbones, with ablations confirming the necessity of each component and the value of adaptive fusion and spatial supervision for trustworthy multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

TL;DR

HuLiRAG presents a cognitively inspired, what–where–reweight cascade to ground multimodal reasoning in region-level visual evidence, addressing hallucinations and rigid global grounding in MLLMs. The pre-stage retrieves candidate images, the What stage decomposes queries into phrases, the Where stage grounds phrases to precise masks via GroundingDINO and SAM, and the Reweight stage adaptively fuses global and local cues with a learnable balance. Spatially-aware fine-tuning further constraints generation using mask-guided supervision to enforce grounding in the answer, improving factual consistency on WebQA and MultimodalQA. Results show substantial retrieval gains and improved VQA performance across backbones, with ablations confirming the necessity of each component and the value of adaptive fusion and spatial supervision for trustworthy multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

Paper Structure

This paper contains 17 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Standard MLLMs struggle with factual VQA due to inadequate perceptual grounding. Our method equips LLMs with the ability to ‘read’ images by dynamically retrieving and aligning semantically relevant visual regions, enabling evidence-based reasoning.
  • Figure 2: Overview of the HuLiRAG framework. Our method implements a human-like staged retrieval pipeline: (1) global feature matching for candidate shortlisting, (2) query-grounded region refinement via detection and segmentation, and (3) adaptive fusion of global and local cues through a learned weighting scheme. The refined evidence feeds into a grounded VQA module, where Alpha-CLIP jointly optimizes global–local alignment.
  • Figure 3: Evaluating LLaVA-Next (7B/8B/13B) with CLIP$\rightarrow$VQA, HuLiRAG, and Oracle under the LLM-as-a-Judge protocol.
  • Figure 4: Comparison between CLIP’s global attention (query-agnostic) and HuLiRAG’s What-Where-Reweight mechanism (query-conditioned). HuLiRAG mimics human perception by adaptively fusing global context with masked local regions. The HuLiRAG heatmaps are obtained by overlaying regional relevance on top of CLIP’s original activation, showing how adaptive fusion sharpens attention to query-relevant evidence.
  • Figure 5: Evaluation by business LLM