Table of Contents
Fetching ...

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

Tao Tang, Dafeng Wei, Zhengyu Jia, Tian Gao, Changwei Cai, Chengkai Hou, Peng Jia, Kun Zhan, Haiyang Sun, Jingchen Fan, Yixing Zhao, Fu Liu, Xiaodan Liang, Xianpeng Lang, Yang Wang

TL;DR

BEV-TSR tackles text-to-scene retrieval in autonomous driving by mapping both vision and language into a BEV-centric space that captures global scene context. It combines a BEV encoder for scenes, an LLM-based text encoder augmented with knowledge-graph embeddings, and a Shared Cross-modal Embedding to align modalities, aided by a caption-generation auxiliary task. The authors introduce nuScenes-Retrieval, a multi-level dataset built on nuScenes, and report state-of-the-art results with top-1 recalls of $85.78\%$ (scene-to-text) and $87.66\%$ (text-to-scene), supported by extensive ablations validating each component. The approach improves retrieval of complex driving scenes and provides a stronger foundation for data-driven optimization in autonomous driving, with potential extensions to multi-sensor data and pre-training paradigms.

Abstract

The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird's Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-text and text-to-scene retrieval respectively. Codes and datasets will be available.

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

TL;DR

BEV-TSR tackles text-to-scene retrieval in autonomous driving by mapping both vision and language into a BEV-centric space that captures global scene context. It combines a BEV encoder for scenes, an LLM-based text encoder augmented with knowledge-graph embeddings, and a Shared Cross-modal Embedding to align modalities, aided by a caption-generation auxiliary task. The authors introduce nuScenes-Retrieval, a multi-level dataset built on nuScenes, and report state-of-the-art results with top-1 recalls of (scene-to-text) and (text-to-scene), supported by extensive ablations validating each component. The approach improves retrieval of complex driving scenes and provides a stronger foundation for data-driven optimization in autonomous driving, with potential extensions to multi-sensor data and pre-training paradigms.

Abstract

The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird's Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-text and text-to-scene retrieval respectively. Codes and datasets will be available.
Paper Structure (24 sections, 5 equations, 7 figures, 4 tables)

This paper contains 24 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) Existing methods are primarily tailored for simple retrieval scenarios. (b) On the contrary, autonomous driving scenarios are challenging with numerous traffic participants and road elements. Then the BEV space offers a clearer global context of the scene than the previous image space, which aligns well with the textual query and serves as an ideal retrieval space. (c) To this end, we propose the novel BEV-TSR framework for text-scene retrieval in autonomous driving, which retrieves scenes in BEV space and demonstrates a significant capability to retrieve traffic scenarios.
  • Figure 2: Overall framework of BEV-TSR.(a) Feature Extraction. For the visual branch, the BEV encoder extracts the BEV embedding from surrounding camera images. For the textual branch, the text embedding is enriched by incorporating the knowledge graph embedding and then fed into a language encoder to generate language embedding. (b) Feature Alignment. First, a set of shared learnable embeddings are employed to bridge the gap between the two branches' features. Moreover, a caption generation auxiliary task further enhances the alignment. Then, the resulting features are aligned with the contrastive loss.
  • Figure 3: Knowledge graph prompting.(a) The knowledge graph embeddings are learned from the autonomous driving knowledge graph. Each node in the graph corresponds to a keyword relevant to autonomous driving, and the embeddings associated with these nodes capture the associative representation of autonomous driving keywords. (b) Subsequently, these keyword knowledge graph embeddings are concatenated with the text embedding, thereby expanding the semantic representation of the encoded text, and then embedded from a language encoder (zoom-in for better views).
  • Figure 4: Detail architecture of SCE.
  • Figure 5: (a) The number of text descriptions. (b) A case on the nuScenes-Retrieval dataset.
  • ...and 2 more figures