Table of Contents
Fetching ...

OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

Ziyue Huang, Yongchao Feng, Shuai Yang, Ziqi Liu, Qingjie Liu, Yunhong Wang

TL;DR

OpenRSD tackles the generalization gap in remote sensing object detection by introducing an open-prompt detector that operates with multimodal prompts and two specialized heads for fast alignment and deep fusion. A three-stage training pipeline and a large ORSD+ dataset enable strong cross-domain performance across seven public RS datasets, handling both oriented and horizontal bounding boxes with real-time inference (~$20.8$ FPS). The method leverages SkyCLIP and DINOv2 prompts, offline prompt dictionaries, and class embeddings to balance vocabulary scalability with precision. Empirical results show OpenRSD outperforms state-of-the-art baselines in OBB tasks and remains competitive with high-precision methods in HBB tasks while offering substantial speed advantages, validating its practical utility for large-scale RS image analysis.

Abstract

Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7\% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.

OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

TL;DR

OpenRSD tackles the generalization gap in remote sensing object detection by introducing an open-prompt detector that operates with multimodal prompts and two specialized heads for fast alignment and deep fusion. A three-stage training pipeline and a large ORSD+ dataset enable strong cross-domain performance across seven public RS datasets, handling both oriented and horizontal bounding boxes with real-time inference (~ FPS). The method leverages SkyCLIP and DINOv2 prompts, offline prompt dictionaries, and class embeddings to balance vocabulary scalability with precision. Empirical results show OpenRSD outperforms state-of-the-art baselines in OBB tasks and remains competitive with high-precision methods in HBB tasks while offering substantial speed advantages, validating its practical utility for large-scale RS image analysis.

Abstract

Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7\% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.

Paper Structure

This paper contains 16 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: We compare the precision and inference speed of OpenRSD against other methods, with inference speed tested on a single 2080Ti GPU.
  • Figure 2: OpenRSD consists of three components: multi-scale image feature encoding, prompt construction, and multi-task detection heads. The prompt construction randomly samples multiple text or image prompts for each category and encodes them into prompt embeddings. The multi-task detection heads leverage the correlation between prompt embeddings and image features to perform detection, seamlessly integrating fusion-based and alignment-based open-prompt structures within one framework. This design enables support for a wide range of prompt-based classification and regression tasks. The alignment head offers higher speed and greater vocabulary scalability, while the fusion head achieves better precision, allowing the model to adapt to different application scenarios. To enhance generalization, we further employ a multi-stage training pipeline.
  • Figure 3: The multi-stage training pipeline includes pretraining, fine-tuning, and self-training. Pretraining trains only the detection modules to adapt to RS detection task. Fine-tuning stage enables the detector to detect arbitrary objects in RS images. Self-training enhances cross-scenario generalization.
  • Figure 4: Visualization results on the DOTA-v2.0 dota2 validation set before (the top row) and after self-training (the bottom row), demonstrating four prompts: detecting any objects, detecting plane, detecting ship, and detecting buildings.