RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation
Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, Weijia Jia
TL;DR
This work targets hallucinations in remote-sensing multimodal LLMs by introducing an RS-specific taxonomy that includes image-level errors. It establishes RSHalluEval, a dual-mode hallucination benchmark enabling online cloud auditing and offline local checking via RSHalluCheck, and builds RSHalluShield for training-friendly mitigation along with decoding-time and RS-aware prompting strategies as training-free remedies. Across RS-MLLMs, the approach yields up to $21.63$ percentage-point gains in hallucination-free performance while preserving RSVQA/RSVG capabilities, demonstrating practical benefits for high-stakes RS deployments. The combination of taxonomy, scalable evaluation, and both training-based and training-free mitigations offers a comprehensive framework for trustworthy remote-sensing AI systems and facilitates downstream RS tasks and real-world applicability.
Abstract
Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
