Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

Lingfeng Zhang; Erjia Xiao; Yuchen Zhang; Haoxiang Fu; Ruibin Hu; Yanbiao Ma; Wenbo Ding; Long Chen; Hangjun Ye; Xiaoshuai Hao

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

Lingfeng Zhang, Erjia Xiao, Yuchen Zhang, Haoxiang Fu, Ruibin Hu, Yanbiao Ma, Wenbo Ding, Long Chen, Hangjun Ye, Xiaoshuai Hao

TL;DR

This work tackles natural language-guided cross-modal drone navigation by introducing a two-stage Caption-Guided Retrieval System (CGRS). It first uses a GeoText-1652 baseline to perform coarse retrieval of the top $20$ candidates, then leverages a Vision-Language Model to generate detailed captions for these images and reranks candidates via text-to-text similarity between the user query and the captions, combining with the coarse score. The approach yields consistent improvements on Recall metrics ($R@1$, $R@5$, $R@10$) and secured a Top-2 finish in the RoboSense 2025 Track 4 challenge, demonstrating the effectiveness of caption-driven semantic refinement in real-world drone navigation. The findings suggest that converting visual content into semantically rich captions provides a robust bridge between language and complex aerial imagery, enabling finer-grained matching in spatially structured scenes.

Abstract

Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5\% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

TL;DR

Abstract

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)