Table of Contents
Fetching ...

StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

Jiaheng Zhuang, Guoan Wang, Siyu Zhang, Xiyang Wang, Hangning Zhou, Ziyao Xu, Chi Zhang, Zhiheng Li

TL;DR

StreamMOTP addresses the problem of jointly performing 3D MOT and trajectory prediction in autonomous driving by introducing a streaming framework that propagates memory, features, and gradients across frames. It couples a memory bank for long-term object features with a relative Spatio-Temporal Positional Encoding to unify tracking and prediction representations, and employs a dual-stream predictor for temporally coherent, multi-modal futures. The MOT head uses differentiable optimal transport (log-Sinkhorn) for robust association, while a Gaussian Mixture Model decoder yields diverse trajectory predictions. Empirical results on nuScenes show state-of-the-art improvements in AMOTA, MOTA, and multi-step prediction metrics, underscoring better occlusion handling and temporal consistency for real-world deployments.

Abstract

3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.

StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

TL;DR

StreamMOTP addresses the problem of jointly performing 3D MOT and trajectory prediction in autonomous driving by introducing a streaming framework that propagates memory, features, and gradients across frames. It couples a memory bank for long-term object features with a relative Spatio-Temporal Positional Encoding to unify tracking and prediction representations, and employs a dual-stream predictor for temporally coherent, multi-modal futures. The MOT head uses differentiable optimal transport (log-Sinkhorn) for robust association, while a Gaussian Mixture Model decoder yields diverse trajectory predictions. Empirical results on nuScenes show state-of-the-art improvements in AMOTA, MOTA, and multi-step prediction metrics, underscoring better occlusion handling and temporal consistency for real-world deployments.

Abstract

3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.
Paper Structure (16 sections, 13 equations, 5 figures, 5 tables)

This paper contains 16 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Different pipelines for the tasks of multi-object tracking and trajectory prediction in autonomous driving. (a) Cascade paradigm, where the two tasks are performed separately with non-differentiable transitions. (b) Joint single-frame paradigm, where the two tasks are performed jointly in a parallelized framework per frame. (c) The proposed StreamMOTP, where the memory, feature, and gradient are propagated across consecutive frames to enhance the long-term modeling ability and temporal consistency.
  • Figure 2: Overview of StreamMOTP. Tracklets and proposals denote the previous frame trajectories and the current frame detections respectively. The model first performs Attentional Spatio-Temporal Interaction, which is based on attention with STPE, to get context features. The tasks of tracking and prediction are then performed based on those context features. Memories with up-to-date context features and tracking results are updated at each time step.
  • Figure 3: Overview of dual-stream predictor. Two branches predict the previous frame trajectories and the current frame detections simultaneously, The streaming connection between consecutive frames smooth the predicted trajectories.
  • Figure 4: The idea of temporal consistency between consecutive frames, where the consistency of the overlap is beneficial for aligning trajectories for continuity and stability.
  • Figure 5: Qualitative results of StreamMOTP on the nuScenes validation set during consecutive frames. The tracked history and detection are shown in black, models' best score prediction and ground-truth trajectories are drawn in blue and red respectively. The predictions of other modes are drawn in gray. The top row shows the results given by the dual-stream predictor while the bottom row shows the results with a base predictor.