Table of Contents
Fetching ...

DEMO: A Dynamics-Enhanced Learning Model for Multi-Horizon Trajectory Prediction in Autonomous Vehicles

Chengyue Wang, Haicheng Liao, Kaiqun Zhu, Guohui Zhang, Zhenning Li

TL;DR

DEMO tackles autonomous-vehicle trajectory prediction across short-term and long-term horizons by integrating physics-based dynamics with learning-based interaction modeling. It introduces a Dynamics Learning Stage that fuses a Dynamic Bicycle Model with a Dynamic Conditional Variational Autoencoder (DynCVAE) to capture immediate motion, and an Interaction Learning Stage that uses a Temporal Encoder, cross-modal fusion with HD-map data, and a Spatial-temporal Encoder to model social and environmental interactions. A Multi-modal Decoder then generates multiple trajectory hypotheses and maneuver probabilities, supervised by losses including $\mathcal{L}_{KL}$, $\mathcal{L}_{DI}$, and dataset-specific accuracy terms. Across NGSIM, HighD, MoCAD, and nuScenes, DEMO achieves state-of-the-art accuracy for both horizons and exhibits fast inference, indicating strong practical potential for real-time AV systems.

Abstract

Autonomous vehicles (AVs) rely on accurate trajectory prediction of surrounding vehicles to ensure the safety of both passengers and other road users. Trajectory prediction spans both short-term and long-term horizons, each requiring distinct considerations: short-term predictions rely on accurately capturing the vehicle's dynamics, while long-term predictions rely on accurately modeling the interaction patterns within the environment. However current approaches, either physics-based or learning-based models, always ignore these distinct considerations, making them struggle to find the optimal prediction for both short-term and long-term horizon. In this paper, we introduce the Dynamics-Enhanced Learning MOdel (DEMO), a novel approach that combines a physics-based Vehicle Dynamics Model with advanced deep learning algorithms. DEMO employs a two-stage architecture, featuring a Dynamics Learning Stage and an Interaction Learning Stage, where the former stage focuses on capturing vehicle motion dynamics and the latter focuses on modeling interaction. By capitalizing on the respective strengths of both methods, DEMO facilitates multi-horizon predictions for future trajectories. Experimental results on the Next Generation Simulation (NGSIM), Macau Connected Autonomous Driving (MoCAD), Highway Drone (HighD), and nuScenes datasets demonstrate that DEMO outperforms state-of-the-art (SOTA) baselines in both short-term and long-term prediction horizons.

DEMO: A Dynamics-Enhanced Learning Model for Multi-Horizon Trajectory Prediction in Autonomous Vehicles

TL;DR

DEMO tackles autonomous-vehicle trajectory prediction across short-term and long-term horizons by integrating physics-based dynamics with learning-based interaction modeling. It introduces a Dynamics Learning Stage that fuses a Dynamic Bicycle Model with a Dynamic Conditional Variational Autoencoder (DynCVAE) to capture immediate motion, and an Interaction Learning Stage that uses a Temporal Encoder, cross-modal fusion with HD-map data, and a Spatial-temporal Encoder to model social and environmental interactions. A Multi-modal Decoder then generates multiple trajectory hypotheses and maneuver probabilities, supervised by losses including , , and dataset-specific accuracy terms. Across NGSIM, HighD, MoCAD, and nuScenes, DEMO achieves state-of-the-art accuracy for both horizons and exhibits fast inference, indicating strong practical potential for real-time AV systems.

Abstract

Autonomous vehicles (AVs) rely on accurate trajectory prediction of surrounding vehicles to ensure the safety of both passengers and other road users. Trajectory prediction spans both short-term and long-term horizons, each requiring distinct considerations: short-term predictions rely on accurately capturing the vehicle's dynamics, while long-term predictions rely on accurately modeling the interaction patterns within the environment. However current approaches, either physics-based or learning-based models, always ignore these distinct considerations, making them struggle to find the optimal prediction for both short-term and long-term horizon. In this paper, we introduce the Dynamics-Enhanced Learning MOdel (DEMO), a novel approach that combines a physics-based Vehicle Dynamics Model with advanced deep learning algorithms. DEMO employs a two-stage architecture, featuring a Dynamics Learning Stage and an Interaction Learning Stage, where the former stage focuses on capturing vehicle motion dynamics and the latter focuses on modeling interaction. By capitalizing on the respective strengths of both methods, DEMO facilitates multi-horizon predictions for future trajectories. Experimental results on the Next Generation Simulation (NGSIM), Macau Connected Autonomous Driving (MoCAD), Highway Drone (HighD), and nuScenes datasets demonstrate that DEMO outperforms state-of-the-art (SOTA) baselines in both short-term and long-term prediction horizons.
Paper Structure (19 sections, 11 equations, 4 figures, 8 tables)

This paper contains 19 sections, 11 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison between the traditional Physics-based model (a), Learning-based model (b), and our proposed DEMO (c). Unlike previous approaches, DEMO employs a two-stage architecture for hierarchical prediction, integrating a Dynamics Learning Stage with an Interaction Learning Stage.
  • Figure 2: Overall architecture of DEMO. In the Dynamics Learning Stage, DEMO takes the historical trajectory as input, processes it through the DynCVAE and the Dynamic Bicycle Model, and ultimately generates the dynamic features and short-term trajectory. In the subsequent Dependencies Learning Stage, the model leverages the enhanced output from the previous stage, along with HD map data, to model the interaction present in the scene. Finally, a Multi-modal Decoder integrates the outputs from both stages to generate multi-modal predictions.
  • Figure 3: Illustration of the Cross-modal Fusion. Panel (a) presents the pipeline of the Cross-modal Fusion process, while Panel (b) provides a detailed illustration of the mechanism of the Cross-modal Attention Head.
  • Figure 4: Qualitative comparison of various models on nuScenes dataset. We select some representative scenarios including T-intersection (a), roundabout (b), interchange (c), and intersection (d). Panels (a), (b), and (c) visualize the most probable predictions of each model alongside the ground truth. Panel (d) visualizes the multi-modal prediction results in comparison with the ground truth.