Table of Contents
Fetching ...

Post-interactive Multimodal Trajectory Prediction for Autonomous Driving

Ziyi Huang, Yang Li, Dushuai Li, Yao Mu, Hongmao Qin, Nan Zheng

TL;DR

This work tackles the uncertainty in autonomous driving trajectory prediction by emphasizing post-interaction features, which have been underexplored. It introduces Pioformer, a coarse-to-fine Transformer framework consisting of a Coarse Trajectory Network (CTN), a Trajectory Proposal Network (TPN) based on a Hyper-Interactor (HGNN), and a Proposal Refinement Network (PRN) that iteratively refines trajectory proposals using post-interaction cues. A three-stage training scheme progressively trains CTN, TPN, and PRN to stabilize learning and leverage high-order interactions, achieving strong accuracy with a compact model on Argoverse 1 and generalizing to Argoverse 2. The approach also demonstrates practical gains for motion planning, yielding safer and more reliable ego-vehicle plans in strongly interactive scenarios. Overall, Pioformer advances multimodal trajectory prediction by explicitly modeling high-order post-interactions and integrating refinement stages with planning considerations, all while maintaining a favorable model size-to-accuracy balance.

Abstract

Modeling the interactions among agents for trajectory prediction of autonomous driving has been challenging due to the inherent uncertainty in agents' behavior. The interactions involved in the predicted trajectories of agents, also called post-interactions, have rarely been considered in trajectory prediction models. To this end, we propose a coarse-to-fine Transformer for multimodal trajectory prediction, i.e., Pioformer, which explicitly extracts the post-interaction features to enhance the prediction accuracy. Specifically, we first build a Coarse Trajectory Network to generate coarse trajectories based on the observed trajectories and lane segments, in which the low-order interaction features are extracted with the graph neural networks. Next, we build a hypergraph neural network-based Trajectory Proposal Network to generate trajectory proposals, where the high-order interaction features are learned by the hypergraphs. Finally, the trajectory proposals are sent to the Proposal Refinement Network for further refinement. The observed trajectories and trajectory proposals are concatenated together as the inputs of the Proposal Refinement Network, in which the post-interaction features are learned by combining the previous interaction features and trajectory consistency features. Moreover, we propose a three-stage training scheme to facilitate the learning process. Extensive experiments on the Argoverse 1 dataset demonstrate the superiority of our method. Compared with the baseline HiVT-64, our model has reduced the prediction errors by 4.4%, 8.4%, 14.4%, 5.7% regarding metrics minADE6, minFDE6, MR6, and brier-minFDE6, respectively.

Post-interactive Multimodal Trajectory Prediction for Autonomous Driving

TL;DR

This work tackles the uncertainty in autonomous driving trajectory prediction by emphasizing post-interaction features, which have been underexplored. It introduces Pioformer, a coarse-to-fine Transformer framework consisting of a Coarse Trajectory Network (CTN), a Trajectory Proposal Network (TPN) based on a Hyper-Interactor (HGNN), and a Proposal Refinement Network (PRN) that iteratively refines trajectory proposals using post-interaction cues. A three-stage training scheme progressively trains CTN, TPN, and PRN to stabilize learning and leverage high-order interactions, achieving strong accuracy with a compact model on Argoverse 1 and generalizing to Argoverse 2. The approach also demonstrates practical gains for motion planning, yielding safer and more reliable ego-vehicle plans in strongly interactive scenarios. Overall, Pioformer advances multimodal trajectory prediction by explicitly modeling high-order post-interactions and integrating refinement stages with planning considerations, all while maintaining a favorable model size-to-accuracy balance.

Abstract

Modeling the interactions among agents for trajectory prediction of autonomous driving has been challenging due to the inherent uncertainty in agents' behavior. The interactions involved in the predicted trajectories of agents, also called post-interactions, have rarely been considered in trajectory prediction models. To this end, we propose a coarse-to-fine Transformer for multimodal trajectory prediction, i.e., Pioformer, which explicitly extracts the post-interaction features to enhance the prediction accuracy. Specifically, we first build a Coarse Trajectory Network to generate coarse trajectories based on the observed trajectories and lane segments, in which the low-order interaction features are extracted with the graph neural networks. Next, we build a hypergraph neural network-based Trajectory Proposal Network to generate trajectory proposals, where the high-order interaction features are learned by the hypergraphs. Finally, the trajectory proposals are sent to the Proposal Refinement Network for further refinement. The observed trajectories and trajectory proposals are concatenated together as the inputs of the Proposal Refinement Network, in which the post-interaction features are learned by combining the previous interaction features and trajectory consistency features. Moreover, we propose a three-stage training scheme to facilitate the learning process. Extensive experiments on the Argoverse 1 dataset demonstrate the superiority of our method. Compared with the baseline HiVT-64, our model has reduced the prediction errors by 4.4%, 8.4%, 14.4%, 5.7% regarding metrics minADE6, minFDE6, MR6, and brier-minFDE6, respectively.

Paper Structure

This paper contains 44 sections, 34 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration of trajectory predictions that consider post-interaction behavior. This study proposes a method, i.e., Pioformer, for trajectory predictions of multiple agents. (a) Current time step $t=T_\text{obs}$. We observe the post-interaction in the coarse predicted trajectory, which refers to the overlap between the predicted trajectories of two agents at future time step $t=T_\text{obs}+N_p\Delta t$. (b) We use Pioformer to make predictions, which utilizes the post-interaction features embedded in coarse trajectories to refine the trajectory predictions. There are no collisions in the refined predicted trajectory. (c) Ground truth trajectories show that there are no collisions between the two agents at time step $t=T_\text{obs}+N_p\Delta t$. This indicates that our model can generate refined predictions that align well with the ground truth trajectory.
  • Figure 2: The architecture of Pioformer. The entire model can be divided into three networks: Coarse Trajectory Network (CTN), Trajectory Proposal Network (TPN) and Proposal Refinement Network (PRN). The top one is the CTN, which includes Scene Encoder, Global Interactor (GI), and Auxiliary Decoder, providing coarse predictions and low-order features (contextual information and simple pairwise interaction features). The bottom-left one is the TPN, which includes Hyper-Interactor (HI) and Decoder, generating trajectory proposals and corresponding confidences. HI leverages the low-order features to extract high-order post-interaction features (interactions beyond pairwise relationships) and refine the trajectories at the latent space. The bottom-right one is the PRN, which takes the trajectory proposals combined with observed trajectories as input. It further extracts post-interaction features of different trajectories and explores spatial-temporal consistency features for individual trajectories to generate offsets and refine initial proposals at the trajectory level. Both TPN and PRN are proposed post-interactive networks.
  • Figure 3: The overview of the three-stage training scheme. The three training stages are executed sequentially from left to right with each subsequent training stage reloading weights from the previous stage. In the first stage, we train CTN to achieve coarse trajectory prediction. In the second stage, we train CTN and TPN together, with the decoder in CTN acting as an auxiliary decoder while the decoder in TPN generates trajectory proposals. In the third stage, we train CTN, TPN and PRN together, with PRN refining the trajectory proposals to produce the final prediction. Each of the three networks has its corresponding trajectory regression loss and confidence classification loss.
  • Figure 4: Overview of error-and-size trade-off for the task of trajectory prediction on Argoverse 1 leaderboard. With compact size, our models outperform most of the state-of-the-art models in prediction accuracy. Especially, the size of Pioformer is approximately one-third that of other models (e.g. HiVT-128 and Macformer-L) that achieve similar accuracy.
  • Figure 5: An example of multimodel trajectory prediction in a traffic scenario. We predict multiple possible trajectories based on the observed trajectories of agents and lane information.
  • ...and 6 more figures