Table of Contents
Fetching ...

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo, Bo Du

TL;DR

Relative Attention-Driven Actively Reasoning (RADAR) is proposed, a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time and consistently improves RS-VQA performance and reduces both factual and logical hallucinations.

Abstract

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

TL;DR

Relative Attention-Driven Actively Reasoning (RADAR) is proposed, a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time and consistently improves RS-VQA performance and reduces both factual and logical hallucinations.

Abstract

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR
Paper Structure (42 sections, 17 equations, 6 figures, 13 tables)

This paper contains 42 sections, 17 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Two grounding failures underlying RS-VQA hallucinations. Type 1 (Cannot find): model attention becomes diffuse and is distracted by irrelevant regions, resulting in missed target localization. Type 2 (Cannot see clearly): attention covers the correct region, but the visual evidence is too small or ambiguous for fine-grained recognition, leading to incorrect predictions.
  • Figure 2: Query-Conditioned Relative Attention (QCRA). Given an input image and two prompts: a task-focused query $Q$ (top) and a global-comprehension query $Q^{G}$ (bottom), we derive layer-wise attention relevance maps. At each layer, token-level attention is reshaped to the image grid and a relative attention matrix is computed by contrasting the task map against the global map to suppress query-irrelevant saliency. The most informative layers (top-$k$) are then aggregated to produce a final query-conditioned attention heatmap, which serves as the grounding signal for region selection and multi-scale evidence construction.
  • Figure 3: Qualitative examples of RADAR's QCRA and progressive evidence refinement. For each example, we visualize the QCRA heatmap generated by a where-oriented query on the full image (top) and a what-oriented query on the localized crop (middle), where brighter regions indicate stronger query-conditioned relevance. Dashed boxes mark the regions selected for zoom-in evidence extraction.
  • Figure 4: Example from human calibration for hallucination labeling.
  • Figure 5: Example from human calibration for hallucination labeling.
  • ...and 1 more figures