Table of Contents
Fetching ...

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

Sergio Casas, Ben Agro, Jiageng Mao, Thomas Gilles, Alexander Cui, Thomas Li, Raquel Urtasun

TL;DR

DeTra tackles the challenge of jointly detecting and forecasting road agents without cascading errors by reframing detection and forecasting as a trajectory refinement problem in BEV. It introduces a refinement transformer that maintains a volume of object queries and poses across time and behavioral modes, refining them through cross-attention to LiDAR and HD-map tokens and a factorized self-attention scheme. Across AV2 and Waymo Open Dataset, DeTra achieves state-of-the-art results on detection, forecasting, and joint metrics, with notable gains in OccAP and TrajAP and robust ablations confirming the contribution of pose refinement and attention design. The approach offers a practical, end-to-end framework for perception and forecasting in autonomous driving, with potential for anytime predictions and improved integration with planning.

Abstract

The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that \ourmodel{} outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

TL;DR

DeTra tackles the challenge of jointly detecting and forecasting road agents without cascading errors by reframing detection and forecasting as a trajectory refinement problem in BEV. It introduces a refinement transformer that maintains a volume of object queries and poses across time and behavioral modes, refining them through cross-attention to LiDAR and HD-map tokens and a factorized self-attention scheme. Across AV2 and Waymo Open Dataset, DeTra achieves state-of-the-art results on detection, forecasting, and joint metrics, with notable gains in OccAP and TrajAP and robust ablations confirming the contribution of pose refinement and attention design. The approach offers a practical, end-to-end framework for perception and forecasting in autonomous driving, with potential for anytime predictions and improved integration with planning.

Abstract

The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that \ourmodel{} outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.
Paper Structure (55 sections, 1 equation, 8 figures, 8 tables)

This paper contains 55 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Modular detection-tracking-forecasting methods use narrow interfaces, hindering their performance due to compounding errors and a lossy information bottleneck. Prior end-to-end methods widen the interfaces and train jointly but still utilize a cascading structure, thus suffering from compounding errors. Our method tackles the problem as a unified trajectory refinement task.
  • Figure 2: DeTra uses object queries and poses to represent objects' trajectories from the present time into the future. DeTra refines the initial estimates of the object queries and poses over $B$ blocks. Each refinement transformer block consists of several attention layers followed by a pose update. The set of poses at the end of the $B$-th block are the final detections and forecasts.
  • Figure 3: A diagram of the attention mechanisms used in DeTra. We factorize self-attention in our learnable query volume into object, mode, and time axes. For efficiency and ease of optimization, we perform deformable attention to LiDAR and $k$-nearest neighbor attention to the map.
  • Figure 4: Qualitative results on AV2. We highlight failure modes in the baselines that DeTra improves: inaccurate detections, off-map predictions, FN modes and FP modes.
  • Figure 5: Visualizing the DeTra's self improvement over refinement blocks on AV2. We highlight improved forecasting accuracy and improved detection accuracy cases.
  • ...and 3 more figures