Table of Contents
Fetching ...

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

Jiancheng Pan, Yanxing Liu, Yuqian Fu, Muyuan Ma, Jiahao Li, Danda Pani Paudel, Luc Van Gool, Xiaomeng Huang

TL;DR

This work addresses the gap in open-vocabulary object detection for remote sensing by formulating Locate Anything on Earth (LAE) and delivering two core contributions: the LAE-Label Engine for assembling a large-scale, diverse RS dataset (LAE-1M) and the LAE-DINO open-vocabulary detector that introduces Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT). The LAE-1M dataset unifies fine- and coarse-grained RS annotations from multiple sources, enabling robust open-vocabulary pretraining, while LAE-DINO leverages a DINO-based architecture with vocabulary-aware prompts and scene-guided semantic alignment to handle thousands of potential categories. Extensive experiments on DIOR, DOTAv2.0, and the new LAE-80C benchmark show clear advantages of LAE-1M for open-set RS detection and of LAE-DINO for both open-set and closed-set tasks, with notable improvements from VisGT and LAE-COD integration. Overall, the work provides a practical pathway to deploy foundation-scale RS detectors capable of locating novel Earth concepts, with implications for environmental monitoring and land-use analysis."

Abstract

Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M - the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

TL;DR

This work addresses the gap in open-vocabulary object detection for remote sensing by formulating Locate Anything on Earth (LAE) and delivering two core contributions: the LAE-Label Engine for assembling a large-scale, diverse RS dataset (LAE-1M) and the LAE-DINO open-vocabulary detector that introduces Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT). The LAE-1M dataset unifies fine- and coarse-grained RS annotations from multiple sources, enabling robust open-vocabulary pretraining, while LAE-DINO leverages a DINO-based architecture with vocabulary-aware prompts and scene-guided semantic alignment to handle thousands of potential categories. Extensive experiments on DIOR, DOTAv2.0, and the new LAE-80C benchmark show clear advantages of LAE-1M for open-set RS detection and of LAE-DINO for both open-set and closed-set tasks, with notable improvements from VisGT and LAE-COD integration. Overall, the work provides a practical pathway to deploy foundation-scale RS detectors capable of locating novel Earth concepts, with implications for environmental monitoring and land-use analysis."

Abstract

Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M - the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
Paper Structure (44 sections, 7 equations, 11 figures, 11 tables)

This paper contains 44 sections, 7 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Locate Anything on Earth (LAE) aims to detect any object on Earth and facilitate practical detection tasks, powered by LAE-Label Engine and LAE-DINO Model.
  • Figure 2: The pipeline of our LAE-Label Engine.
  • Figure 3: The pipeline for LAE-DINO.
  • Figure 4: VisGT maps visual features into semantic space. The scene features are instance-level and category-relative features from different textual features in an image, which represents the scenographic information from the image. For example, airplane and vehicle belong to the airport.
  • Figure 5: The pipeline of our LAE-Label Engine.
  • ...and 6 more figures