Table of Contents
Fetching ...

Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng

TL;DR

The paper tackles the mismatch between remote sensing imagery and vision-language systems by constructing instruction-tuning datasets that translate RS annotations into JSON-based natural language prompts, enabling object detection in RS without altering the VLM architecture. It fine-tunes a pre-trained vision-language model using LoRA with embedding noise, exploring rank configurations to identify an efficient, effective setup (notably rank 32) that yields competitive Precision, Recall, and F1 at IoU 0.5. The approach demonstrates that RS object detection can be achieved with language guidance and that the model retains VQA and scene-description capabilities, showcasing end-to-end RS understanding in a unified framework. This work paves the way for language-driven RS tasks and interactive analyses, while highlighting avenues for extending to visual grounding and counting, and assessing real-world inference efficiency.

Abstract

Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often leads to unsatisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we constructed supervised fine-tuning (SFT) datasets using publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10. In these new datasets, we converted annotation information into JSON-compliant natural language descriptions, facilitating more effective understanding and training for the VLM. We then evaluate the detection performance of various fine-tuning strategies for VLMs and derive optimized model weights for object detection in remote sensing images. Finally, we evaluate the model's prior knowledge capabilities using natural language queries. Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our datasets and related code will be released soon.

Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

TL;DR

The paper tackles the mismatch between remote sensing imagery and vision-language systems by constructing instruction-tuning datasets that translate RS annotations into JSON-based natural language prompts, enabling object detection in RS without altering the VLM architecture. It fine-tunes a pre-trained vision-language model using LoRA with embedding noise, exploring rank configurations to identify an efficient, effective setup (notably rank 32) that yields competitive Precision, Recall, and F1 at IoU 0.5. The approach demonstrates that RS object detection can be achieved with language guidance and that the model retains VQA and scene-description capabilities, showcasing end-to-end RS understanding in a unified framework. This work paves the way for language-driven RS tasks and interactive analyses, while highlighting avenues for extending to visual grounding and counting, and assessing real-world inference efficiency.

Abstract

Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often leads to unsatisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we constructed supervised fine-tuning (SFT) datasets using publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10. In these new datasets, we converted annotation information into JSON-compliant natural language descriptions, facilitating more effective understanding and training for the VLM. We then evaluate the detection performance of various fine-tuning strategies for VLMs and derive optimized model weights for object detection in remote sensing images. Finally, we evaluate the model's prior knowledge capabilities using natural language queries. Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our datasets and related code will be released soon.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model and dataset format diagram. (a) Model architecture of Qwen2.5-vl. (b) Datasets Compare.
  • Figure 2: Lora Fine Tuning.
  • Figure 3: VQA Dialogue.