Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua
TL;DR
This work tackles natural language-guided drone geolocalization by addressing data scarcity and fine-grained language-vision alignment. It introduces GeoText-1652, a three-platform extension of University-1652 with image-text-bbox annotations totaling 276,045 text-bbox pairs and 316,335 descriptions, enabling two tasks: text-guided drone navigation and drone-view target localization. A cross-modal framework based on the XVLM backbone integrates image-text semantic matching with a novel blending spatial matching, formalized by the total loss $\mathcal{L}_{total}=\mathcal{L}_{itc}+\mathcal{L}_{itm}+\lambda(\mathcal{L}_{grounding}+\mathcal{L}_{spatial})$ where $\lambda=0.1$, to reason about region-level spatial relations. Experiments show improved Recall@K over baselines and demonstrate robustness to unseen scenes, indicating strong potential for accurate, language-driven drone control in real-world environments.
Abstract
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.
