Table of Contents
Fetching ...

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

Lingfeng Zhang, Erjia Xiao, Yuchen Zhang, Haoxiang Fu, Ruibin Hu, Yanbiao Ma, Wenbo Ding, Long Chen, Hangjun Ye, Xiaoshuai Hao

TL;DR

This work tackles natural language-guided cross-modal drone navigation by introducing a two-stage Caption-Guided Retrieval System (CGRS). It first uses a GeoText-1652 baseline to perform coarse retrieval of the top $20$ candidates, then leverages a Vision-Language Model to generate detailed captions for these images and reranks candidates via text-to-text similarity between the user query and the captions, combining with the coarse score. The approach yields consistent improvements on Recall metrics ($R@1$, $R@5$, $R@10$) and secured a Top-2 finish in the RoboSense 2025 Track 4 challenge, demonstrating the effectiveness of caption-driven semantic refinement in real-world drone navigation. The findings suggest that converting visual content into semantically rich captions provides a robust bridge between language and complex aerial imagery, enabling finer-grained matching in spatially structured scenes.

Abstract

Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5\% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4

TL;DR

This work tackles natural language-guided cross-modal drone navigation by introducing a two-stage Caption-Guided Retrieval System (CGRS). It first uses a GeoText-1652 baseline to perform coarse retrieval of the top candidates, then leverages a Vision-Language Model to generate detailed captions for these images and reranks candidates via text-to-text similarity between the user query and the captions, combining with the coarse score. The approach yields consistent improvements on Recall metrics (, , ) and secured a Top-2 finish in the RoboSense 2025 Track 4 challenge, demonstrating the effectiveness of caption-driven semantic refinement in real-world drone navigation. The findings suggest that converting visual content into semantically rich captions provides a robust bridge between language and complex aerial imagery, enabling finer-grained matching in spatially structured scenes.

Abstract

Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5\% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.

Paper Structure

This paper contains 15 sections, 12 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of our Caption-Guided Retrieval System (CGRS). Our framework employs a two-stage pipeline: the coarse-grained model first retrieves the top 20 candidate images from the gallery using the GeoText-1652 baseline. The fine-grained model then generates detailed captions for these candidates using a Vision-Language Model (VLM) and performs semantic reranking based on text-to-text similarity between the query and generated captions, producing the final top 20 results.
  • Figure 2: Qualitative Results.