VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang; Xingxing Weng; Jiang Wu; Jiayu Li; Yi Liu; Jiaxing Sun; Weijia Li; Shuai Wang; Litong Feng; Gui-Song Xia; Conghui He

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, Conghui He

TL;DR

This work tackles the need for versatile and truthful RS vision-language understanding by introducing VersaD, a rich-caption RS image-text dataset, and HnstD, an honest instruction set with factual and deceptive queries. It presents VHM, a three-component VLM built with multi-level RS visual representations and a two-stage training pipeline that yields strong performance on scene classification, VQA, and grounding while enabling honest question answering. The approach expands RS capabilities to tasks such as building vectorizing and multi-label classification, and demonstrates that rich, domain-specific captions coupled with honesty cues significantly improve robustness and generalization. By releasing datasets and models, the work provides a practical path toward trustworthy RS AI systems and highlights future potential in segmentation and change detection.

Abstract

This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at https://github.com/opendatalab/VHM .

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 17 figures, 13 tables)

This paper contains 26 sections, 2 equations, 17 figures, 13 tables.

Introduction
Versatile and Honest Datasets
VersaD
HnstD
Versatile and Honest VLM
Model Architecture
Training Strategy
Experiments
Datasets
Evaluation on Versatility
Evaluation on Honesty
Ablation Studies
Conclusion
Appendix Overview
More Details about VersaD
...and 11 more sections

Figures (17)

Figure 1: Illustration of versatility and honesty. In (a), words in red and bold are key pieces of information in the captions. Existing datasets for pretraining VLMs typically contain sparse-content captions, focusing on a few prominent objects and their relationships. In contrast, VersaD captions provide detailed descriptions of image properties, object attributes, and scene context. These rich-content captions contribute to a more thorough understanding of RS images, thereby enhancing VLMs' ability to perform diverse RS tasks. Additionally, instruction datasets for fine-tuning VLMs usually contain only factual questions about existent objects within images (see words in orange in (a)), which can result in VLMs lying to produce affirmative answers to nonsense queries about non-existent objects. In contrast, our HnstD includes both factual and deceptive questions, designed to instill honesty in VLMs.
Figure 2: Prompts for generating rich-content captions.
Figure 3: Samples in the HnstD dataset.
Figure 4: Architecture of the proposed VHM.
Figure 5: Conversations between users and VHM.
...and 12 more figures

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

TL;DR

Abstract

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (17)