Table of Contents
Fetching ...

SGNetPose+: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving

Akshat Ghiya, Ali K. AlShami, Jugal Kalita

TL;DR

SGNetPose+ addresses pedestrian trajectory prediction for autonomous driving by fusing bounding-box information with pose cues (skeleton joints and body angles) through a dual-encoder, CVAE, and Stepwise Goal Estimator framework. It leverages ViTPose to extract 13-keypoint skeletons and applies horizontal-frame augmentation to create pose-enriched JAAD_pose and PIE_pose datasets, achieving state-of-the-art results against the SGNet baseline. Skeleton data yield the strongest gains on JAAD_pose, with notable reductions in MSE and final-frame errors, while PIE_pose benefits are more modest and data-size dependent. The study demonstrates the value of incorporating pose information for more accurate trajectory prediction and points to 3D skeletons and orientation-based features as promising directions for further improvement.

Abstract

Predicting pedestrian trajectories is essential for autonomous driving systems, as it significantly enhances safety and supports informed decision-making. Accurate predictions enable the prevention of collisions, anticipation of crossing intent, and improved overall system efficiency. In this study, we present SGNetPose+, an enhancement of the SGNet architecture designed to integrate skeleton information or body segment angles with bounding boxes to predict pedestrian trajectories from video data to avoid hazards in autonomous driving. Skeleton information was extracted using a pose estimation model, and joint angles were computed based on the extracted joint data. We also apply temporal data augmentation by horizontally flipping video frames to increase the dataset size and improve performance. Our approach achieves state-of-the-art results on the JAAD and PIE datasets using pose data with the bounding boxes, outperforming the SGNet model. Code is available on Github: SGNetPose+.

SGNetPose+: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving

TL;DR

SGNetPose+ addresses pedestrian trajectory prediction for autonomous driving by fusing bounding-box information with pose cues (skeleton joints and body angles) through a dual-encoder, CVAE, and Stepwise Goal Estimator framework. It leverages ViTPose to extract 13-keypoint skeletons and applies horizontal-frame augmentation to create pose-enriched JAAD_pose and PIE_pose datasets, achieving state-of-the-art results against the SGNet baseline. Skeleton data yield the strongest gains on JAAD_pose, with notable reductions in MSE and final-frame errors, while PIE_pose benefits are more modest and data-size dependent. The study demonstrates the value of incorporating pose information for more accurate trajectory prediction and points to 3D skeletons and orientation-based features as promising directions for further improvement.

Abstract

Predicting pedestrian trajectories is essential for autonomous driving systems, as it significantly enhances safety and supports informed decision-making. Accurate predictions enable the prevention of collisions, anticipation of crossing intent, and improved overall system efficiency. In this study, we present SGNetPose+, an enhancement of the SGNet architecture designed to integrate skeleton information or body segment angles with bounding boxes to predict pedestrian trajectories from video data to avoid hazards in autonomous driving. Skeleton information was extracted using a pose estimation model, and joint angles were computed based on the extracted joint data. We also apply temporal data augmentation by horizontally flipping video frames to increase the dataset size and improve performance. Our approach achieves state-of-the-art results on the JAAD and PIE datasets using pose data with the bounding boxes, outperforming the SGNet model. Code is available on Github: SGNetPose+.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A frame from the JAAD dataset, showing a woman in a white coat crossing the street with her bounding box superimposed. A car in approaching in the center lane to her right.
  • Figure 2: Visualization of SGNetPose+. Encoder time evolves vertically, from time $t$ to $t+1$, while decoder time flows horizontally, predicting the trajectory from $t+1$ to $t+ld$. At the start, trajectory information (bounding boxes, denoted as $x^t$) are fed to an RNN cell, while the pose information (skeleton or body angle, denoted as $p^t$) are fed to a separate RNN cell. The output of the cell is fed to the CVAE and to the SGE (Stepwise Goal Estimator). The CVAE output, SGE's estimated goals, and RNN cell outputs are combined and fed to the decoder, which produces the predicted location $y^t$.
  • Figure 3: Comparison of test loss for JAAD and PIE datasets.
  • Figure 4: Comparison of $JAAD_{pose}$ and $PIE_{pose}$ with different metrics.