Table of Contents
Fetching ...

BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals

Minsang Kong, Myeongjun Kim, Sang Gu Kang, Hejiu Lu, Yupeng Zhong, Sang Hun Lee

TL;DR

Bird's-Eye View Trajectory Prediction (BEVTraj) is proposed, a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features, enabling fully end-to-end multimodal forecasting without heuristic post-processing.

Abstract

In autonomous driving, trajectory prediction is essential for safe and efficient navigation. While recent methods often rely on high-definition (HD) maps to provide structured environmental priors, such maps are costly to maintain, geographically limited, and unreliable in dynamic or unmapped scenarios. Directly leveraging raw sensor data in Bird's-Eye View (BEV) space offers greater flexibility, but BEV features are dense and unstructured, making agent-centric spatial reasoning challenging and computationally inefficient. To address this, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features. We further introduce a Sparse Goal Candidate Proposal (SGCP) module that predicts a small set of realistic goals, enabling fully end-to-end multimodal forecasting without heuristic post-processing. Extensive experiments show that BEVTraj achieves performance comparable to state-of-the-art HD map-based methods while providing greater robustness and flexibility without relying on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.

BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals

TL;DR

Bird's-Eye View Trajectory Prediction (BEVTraj) is proposed, a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features, enabling fully end-to-end multimodal forecasting without heuristic post-processing.

Abstract

In autonomous driving, trajectory prediction is essential for safe and efficient navigation. While recent methods often rely on high-definition (HD) maps to provide structured environmental priors, such maps are costly to maintain, geographically limited, and unreliable in dynamic or unmapped scenarios. Directly leveraging raw sensor data in Bird's-Eye View (BEV) space offers greater flexibility, but BEV features are dense and unstructured, making agent-centric spatial reasoning challenging and computationally inefficient. To address this, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features. We further introduce a Sparse Goal Candidate Proposal (SGCP) module that predicts a small set of realistic goals, enabling fully end-to-end multimodal forecasting without heuristic post-processing. Extensive experiments show that BEVTraj achieves performance comparable to state-of-the-art HD map-based methods while providing greater robustness and flexibility without relying on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.

Paper Structure

This paper contains 49 sections, 4 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Two possible approaches for trajectory prediction in the absence of a pre-defined HD map. The first approach (top) constructs an HD map in real time and applies conventional HD map-based prediction methods. The second approach (bottom), proposed in this study as BEVTraj, directly predicts trajectories by leveraging BEV features extracted from raw sensor data.
  • Figure 2: Overall architecture of BEVTraj. Sensor Encoder processes multimodal sensor data (e.g., camera images, LiDAR point clouds) to generate BEV feature, while Pre-Encoder captures agent motion history. BEV Deformable Aggregation (BDA) module efficiently compresses BEV feature into a compact representation, which is then integrated with Pre-Encoder’s output through local self-attention. Iterative Deformable Decoder predicts the target agent's trajectory and iteratively refines it using both BEV feature and scene context feature.
  • Figure 3: Architecture of the BEV Deformable Aggregation (BDA) module. The BA queries and learnable reference positions are iteratively refined through self-attention and deformable cross-attention layers. The final BA queries are passed through an MLP to produce BEV aggregated features.
  • Figure 4: Structure of the Iterative Deformable Decoder, consisting of three sub-modules for multimodal trajectory prediction: goal proposal, initial prediction, and iterative refinement. Deformable attention is used in each stage to process BEV features.
  • Figure 5: Comparison of goal candidate proposal methods. (a) DenseTNT generates dense goal candidates along lanes. (b) MTR defines intention points via k-means clustering. (c) Our SGCP module predicts a sparse set of goal candidates conditioned on the agent’s dynamic state and the BEV feature map.
  • ...and 9 more figures