SGNetPose+: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving
Akshat Ghiya, Ali K. AlShami, Jugal Kalita
TL;DR
SGNetPose+ addresses pedestrian trajectory prediction for autonomous driving by fusing bounding-box information with pose cues (skeleton joints and body angles) through a dual-encoder, CVAE, and Stepwise Goal Estimator framework. It leverages ViTPose to extract 13-keypoint skeletons and applies horizontal-frame augmentation to create pose-enriched JAAD_pose and PIE_pose datasets, achieving state-of-the-art results against the SGNet baseline. Skeleton data yield the strongest gains on JAAD_pose, with notable reductions in MSE and final-frame errors, while PIE_pose benefits are more modest and data-size dependent. The study demonstrates the value of incorporating pose information for more accurate trajectory prediction and points to 3D skeletons and orientation-based features as promising directions for further improvement.
Abstract
Predicting pedestrian trajectories is essential for autonomous driving systems, as it significantly enhances safety and supports informed decision-making. Accurate predictions enable the prevention of collisions, anticipation of crossing intent, and improved overall system efficiency. In this study, we present SGNetPose+, an enhancement of the SGNet architecture designed to integrate skeleton information or body segment angles with bounding boxes to predict pedestrian trajectories from video data to avoid hazards in autonomous driving. Skeleton information was extracted using a pose estimation model, and joint angles were computed based on the extracted joint data. We also apply temporal data augmentation by horizontally flipping video frames to increase the dataset size and improve performance. Our approach achieves state-of-the-art results on the JAAD and PIE datasets using pose data with the bounding boxes, outperforming the SGNet model. Code is available on Github: SGNetPose+.
