Table of Contents
Fetching ...

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention

Xunjiang Gu, Guanyu Song, Igor Gilitschenski, Marco Pavone, Boris Ivanovic

TL;DR

This paper addresses the inefficiency and information loss in online HD map estimation by directly leveraging internal BEV features to couple mapping with trajectory prediction. It introduces three BEV-based strategies: (1) agent–BEV attention to model agent-lane interactions, (2) augmenting estimated lanes with BEV features, and (3) replacing agent information with temporal BEV features. Across multiple mapping and prediction models evaluated on nuScenes, the approach yields up to 73% faster inference and up to 29% improvements in prediction accuracy, with ablations highlighting optimal BEV patch sizes and the value of temporal BEV information. The results demonstrate that exploiting BEV features from online map estimation can significantly enhance end-to-end autonomous driving pipelines, reducing computation without sacrificing—and often improving—predictive performance. Potential limitations include reliance on black-box BEV representations and opportunities for better interpretability and co-training between mapping and prediction components.

Abstract

Understanding road geometry is a critical component of the autonomous vehicle (AV) stack. While high-definition (HD) maps can readily provide such information, they suffer from high labeling and maintenance costs. Accordingly, many recent works have proposed methods for estimating HD maps online from sensor data. The vast majority of recent approaches encode multi-camera observations into an intermediate representation, e.g., a bird's eye view (BEV) grid, and produce vector map elements via a decoder. While this architecture is performant, it decimates much of the information encoded in the intermediate representation, preventing downstream tasks (e.g., behavior prediction) from leveraging them. In this work, we propose exposing the rich internal features of online map estimation methods and show how they enable more tightly integrating online mapping with trajectory forecasting. In doing so, we find that directly accessing internal BEV features yields up to 73% faster inference speeds and up to 29% more accurate predictions on the real-world nuScenes dataset.

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention

TL;DR

This paper addresses the inefficiency and information loss in online HD map estimation by directly leveraging internal BEV features to couple mapping with trajectory prediction. It introduces three BEV-based strategies: (1) agent–BEV attention to model agent-lane interactions, (2) augmenting estimated lanes with BEV features, and (3) replacing agent information with temporal BEV features. Across multiple mapping and prediction models evaluated on nuScenes, the approach yields up to 73% faster inference and up to 29% improvements in prediction accuracy, with ablations highlighting optimal BEV patch sizes and the value of temporal BEV information. The results demonstrate that exploiting BEV features from online map estimation can significantly enhance end-to-end autonomous driving pipelines, reducing computation without sacrificing—and often improving—predictive performance. Potential limitations include reliance on black-box BEV representations and opportunities for better interpretability and co-training between mapping and prediction components.

Abstract

Understanding road geometry is a critical component of the autonomous vehicle (AV) stack. While high-definition (HD) maps can readily provide such information, they suffer from high labeling and maintenance costs. Accordingly, many recent works have proposed methods for estimating HD maps online from sensor data. The vast majority of recent approaches encode multi-camera observations into an intermediate representation, e.g., a bird's eye view (BEV) grid, and produce vector map elements via a decoder. While this architecture is performant, it decimates much of the information encoded in the intermediate representation, preventing downstream tasks (e.g., behavior prediction) from leveraging them. In this work, we propose exposing the rich internal features of online map estimation methods and show how they enable more tightly integrating online mapping with trajectory forecasting. In doing so, we find that directly accessing internal BEV features yields up to 73% faster inference speeds and up to 29% more accurate predictions on the real-world nuScenes dataset.
Paper Structure (20 sections, 3 equations, 10 figures, 5 tables)

This paper contains 20 sections, 3 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Online map estimation approaches predominantly encode multi-camera observations into a canonical BEV feature grid prior to decoding vectorized map elements. In this work, we propose deeply integrating online mapping with downstream tasks through direct access to the rich BEV features of online map estimation methods.
  • Figure 2: Three different strategies for incorporating BEV features in behavior prediction. Left: local region attention to encode agent-map interaction; Middle: augmenting lane vertices with BEV features; Right: replacing agent trajectories with temporal BEV features.
  • Figure 3: Our integrated BEV-prediction approach runs faster than decoupled baselines across all scenario sizes (number of agents and map elements) and mapping models.
  • Figure 4: StreamMapNet yuan2024streammapnet and HiVT zhou2022hivt combined using the strategy in \ref{['sec:replace_attend']}. By replacing lane information with temporal BEV features, HiVT is able to keep its predicted trajectories in the current lane, closely aligning with the GT trajectory.
  • Figure 5: MapTR MapTR and DenseTNT GuSunEtAl2021 combined via the strategy in \ref{['sec:strategy_2']}. Our augmentation of map vertices with BEV features enables DenseTNT to produce very accurate trajectories, preventing the road boundary incursions seen in the Baseline and Uncertainty-enhanced GuSongEtAl2024 setups.
  • ...and 5 more figures