Table of Contents
Fetching ...

Spatial Retrieval Augmented Autonomous Driving

Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, Chen Chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang

TL;DR

The paper introduces spatial retrieval to augment autonomous driving with offline geographic imagery, addressing perception horizon limits and occlusion. It presents nuScenes-Geography, a dataset extension using Google Maps data, and a plug-and-play Spatial Retrieval Adapter with a Reliability Estimation gate to fuse geography into BEV-based tasks. Across object detection, online mapping, occupancy, planning, and generative world modeling, geographic priors improve performance and temporal consistency, especially in challenging conditions, while remaining robust to incomplete retrieval. The work provides open-source data, pipelines, and baselines to promote retrieval-augmented autonomous driving research.

Abstract

Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.

Spatial Retrieval Augmented Autonomous Driving

TL;DR

The paper introduces spatial retrieval to augment autonomous driving with offline geographic imagery, addressing perception horizon limits and occlusion. It presents nuScenes-Geography, a dataset extension using Google Maps data, and a plug-and-play Spatial Retrieval Adapter with a Reliability Estimation gate to fuse geography into BEV-based tasks. Across object detection, online mapping, occupancy, planning, and generative world modeling, geographic priors improve performance and temporal consistency, especially in challenging conditions, while remaining robust to incomplete retrieval. The work provides open-source data, pipelines, and baselines to promote retrieval-augmented autonomous driving research.

Abstract

Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.

Paper Structure

This paper contains 100 sections, 30 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Existing autonomous driving systems (upper) largely rely on onboard sensors, which are vulnerable to perceptual conditions. Inspired by human drivers’ ability to recall memory of previously seen roads, we introduce the spatial retrieval paradigm (lower) which utilizes offline cached geographic images as an extra input modality to enhance the performance of multiple AD tasks.
  • Figure 2: Spatial Retrieval Adapter. We adopt cross-attention with standard BEV features ($\mathbf{F}_{\text{BEV}}$) as query and the retrieved geographic features ($\mathbf{F}_{\text{geo}}$) plus corresponding 3D positional encodings ($\mathbf{P}_{\text{geo}})$ as key and value. For generative world model task, we use similar cross-attention architecture, with the noised latents as query. Further, since the whole driving trajectory is known for video diffusion, we retrieve the corresponding geographic images based on the start and end frame positions so that the background becomes consistent.
  • Figure 3: Reliability Estimation Gate. We set a Reliability Estimation Gate: when retrieved geographic missing or misaligned, the residual update approximates zero based on difference between pose and image feature.
  • Figure 4: Geographic Data Curation from Google Maps. We use the GPS coordinates from nuScenes ego poses to query Google Map. Each unique panorama is downloaded once, decomposed into 18 yaw-sampled views, and projected onto an equirectangular panorama representation. For each camera at each frame in nuScenes, a virtual camera with matched intrinsics/extrinsics reprojects an aligned street view image from equirectangular panorama, effectively reducing redundant downloads and storage.
  • Figure 5: Correspondence Relation between Frames and Geographic Data. nuScenes frames exhibit a higher acquisition frequency than street view data along the same road. Each frame is matched to its geographically nearest street view data, where multiple frames may correspond to the same street view data.
  • ...and 17 more figures