Table of Contents
Fetching ...

RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, Weijia Jia

TL;DR

This work targets hallucinations in remote-sensing multimodal LLMs by introducing an RS-specific taxonomy that includes image-level errors. It establishes RSHalluEval, a dual-mode hallucination benchmark enabling online cloud auditing and offline local checking via RSHalluCheck, and builds RSHalluShield for training-friendly mitigation along with decoding-time and RS-aware prompting strategies as training-free remedies. Across RS-MLLMs, the approach yields up to $21.63$ percentage-point gains in hallucination-free performance while preserving RSVQA/RSVG capabilities, demonstrating practical benefits for high-stakes RS deployments. The combination of taxonomy, scalable evaluation, and both training-based and training-free mitigations offers a comprehensive framework for trustworthy remote-sensing AI systems and facilitates downstream RS tasks and real-world applicability.

Abstract

Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

TL;DR

This work targets hallucinations in remote-sensing multimodal LLMs by introducing an RS-specific taxonomy that includes image-level errors. It establishes RSHalluEval, a dual-mode hallucination benchmark enabling online cloud auditing and offline local checking via RSHalluCheck, and builds RSHalluShield for training-friendly mitigation along with decoding-time and RS-aware prompting strategies as training-free remedies. Across RS-MLLMs, the approach yields up to percentage-point gains in hallucination-free performance while preserving RSVQA/RSVG capabilities, demonstrating practical benefits for high-stakes RS deployments. The combination of taxonomy, scalable evaluation, and both training-based and training-free mitigations offers a comprehensive framework for trustworthy remote-sensing AI systems and facilitates downstream RS tasks and real-world applicability.

Abstract

Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
Paper Structure (41 sections, 7 equations, 15 figures, 11 tables)

This paper contains 41 sections, 7 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Overview of our RSHallu pipeline for studying hallucinations in RS MLLMs. We (1) establish an RS-oriented definition and taxonomy (including image-level hallucination); (2) develop RSHalluEval with dual-mode checking: online evaluation and a locally deployable checker fine-tuned on RSHalluCheck; and (3) mitigate hallucinations via training-friendly fine-tuning on RSHalluShield and training-free plug-and-play strategies (decoding- and prompt-based).
  • Figure 2: Radar plot of hallucination-free (HF) rates for representative MLLMs on RSHalluEval. HF is averaged from expert scores (higher is better) over image-level categories ($HF_{IA}$, $HF_{IS}$) and object-level categories ($HF_{OE}$, $HF_{OA}$, $HF_{OR}$), corresponding to image attributes (IA), image scenes (IS), object existence (OE), object attributes (OA), and object relations (OR).
  • Figure 3: Examples of different types of object hallucinations. From left to right are hallucinations of object existence, object attributes, and object relations. We present the hallucinatory output texts, with hallucinatory parts marked in red. The correct outputs are shown below.
  • Figure 4: Some examples of hallucinations are triggered when RS MLLMs fail to comprehend image-level information. The hallucinatory parts are marked in red.
  • Figure 5: Examples of different types of image-level hallucinations. Among them, examples with orange borders belong to image attribute hallucinations, whereas those with blue borders belong to image scene hallucinations. The hallucinatory parts are shown in red.
  • ...and 10 more figures