Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi; Chenxi Yang; Hong Ding; Yiqing Ni; Catherine C. Liu; Yunhao Liu; Chengqi Zhang

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu, Yunhao Liu, Chengqi Zhang

TL;DR

HuLiRAG presents a cognitively inspired, what–where–reweight cascade to ground multimodal reasoning in region-level visual evidence, addressing hallucinations and rigid global grounding in MLLMs. The pre-stage retrieves candidate images, the What stage decomposes queries into phrases, the Where stage grounds phrases to precise masks via GroundingDINO and SAM, and the Reweight stage adaptively fuses global and local cues with a learnable balance. Spatially-aware fine-tuning further constraints generation using mask-guided supervision to enforce grounding in the answer, improving factual consistency on WebQA and MultimodalQA. Results show substantial retrieval gains and improved VQA performance across backbones, with ablations confirming the necessity of each component and the value of adaptive fusion and spatial supervision for trustworthy multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

TL;DR

Abstract

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)