Table of Contents
Fetching ...

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua

TL;DR

This work tackles natural language-guided drone geolocalization by addressing data scarcity and fine-grained language-vision alignment. It introduces GeoText-1652, a three-platform extension of University-1652 with image-text-bbox annotations totaling 276,045 text-bbox pairs and 316,335 descriptions, enabling two tasks: text-guided drone navigation and drone-view target localization. A cross-modal framework based on the XVLM backbone integrates image-text semantic matching with a novel blending spatial matching, formalized by the total loss $\mathcal{L}_{total}=\mathcal{L}_{itc}+\mathcal{L}_{itm}+\lambda(\mathcal{L}_{grounding}+\mathcal{L}_{spatial})$ where $\lambda=0.1$, to reason about region-level spatial relations. Experiments show improved Recall@K over baselines and demonstrate robustness to unseen scenes, indicating strong potential for accurate, language-driven drone control in real-world environments.

Abstract

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

TL;DR

This work tackles natural language-guided drone geolocalization by addressing data scarcity and fine-grained language-vision alignment. It introduces GeoText-1652, a three-platform extension of University-1652 with image-text-bbox annotations totaling 276,045 text-bbox pairs and 316,335 descriptions, enabling two tasks: text-guided drone navigation and drone-view target localization. A cross-modal framework based on the XVLM backbone integrates image-text semantic matching with a novel blending spatial matching, formalized by the total loss where , to reason about region-level spatial relations. Experiments show improved Recall@K over baselines and demonstrate robustness to unseen scenes, indicating strong potential for accurate, language-driven drone control in real-world environments.

Abstract

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.
Paper Structure (12 sections, 6 equations, 7 figures, 3 tables)

This paper contains 12 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An example of the proposed benchmark, GeoText-1652. Here we show a text-guided drone geolocalization process. Left: Every image contains several region-level query sentences. Middle: Given the user description, we match the text and region of interest with the spatial relation. Right: With the dense spatial relation matching, we could easily retrieve the place of interest against other similar false-positives, and navigate the drone. It is worth noting that multiple similar-appearance buildings usually exist in the neighbour regions, so we also indicate the relative position, e.g., left, right, upper, and down, in the text query.
  • Figure 2: The properties of the proposed dataset GeoText-1652. Different from the traditional category annotation, our dataset not only includes image-level detailed descriptions but region-level short descriptions (left). Samples of the dataset show that the description could align well with the image and its regions (right).
  • Figure 3: The proposed human-computer interaction annotation strategy. The strategy includes two main processes: modality expansion annotator and spatial refinement annotator. The modality expansion annotator is to annotate the image-level and the region-level descriptions. The spatial refinement annotator could utilize the region-level description to conduct the visual grounding. Finally, after human-computer filtering processes, we build the proposed dataset with Image-Text-Bbox Pairs.
  • Figure 4: (a)Comparison between the proposed GeoText-1652 dataset and other existing geolocalization datasets. The labels G, S, and D represent ground-view, satellite-view, and drone-view images, respectively. (b)Why is Relative Position Necessary? Here we show some typical challenging cases. In these rows, similar objects (boats, towers, skyscrapers) are difficult to distinguish based on their characteristics alone. However, their spatial relationships (left, middle, right) can effectively aid in distinguishing them. Color-coded boxes highlight the main object from left to right.
  • Figure 5: The proposed multi-modal framework. The framework processes an aerial image by identifying regions of interest (ROIs) and matching them with corresponding text descriptions. It contains an image encoder that extracts visual embeddings and intermediate feature maps. We could obtain region-level visual features via ROI Pooling, and concatenate to calculate the spatial relation followed by multi-layer perceptron (MLP). On the other hand, text inputs, including the image-level and region-level descriptions, are encoded separately with the text encoder. Two attention modules integrate the image and text features, and they share the same weights. The framework applies several loss functions, including Grounding and Spatial Loss for blending spatial matching, and ITM and ITC Loss for image-text matching.
  • ...and 2 more figures