Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception
Nhat Le, Daeun Song, Xuesu Xiao
TL;DR
This work investigates which human skeletal cues best predict multi-agent trajectories from egocentric robot perception, focusing on lower-body 3D keypoints and derived biomechanical cues. Using the Human Scene Transformer backbone, the authors systematically compare 3D vs 2D pose inputs and full vs lower-body regions on JRDB and a new 360° panoramic dataset, evaluating with $MinADE$, $MinFDE$, $MLADE$, and $NLL_{pos}$. They find that lower-body 3D keypoints yield the largest improvements (e.g., ~13% ADE reduction) and that 2D keypoints from equirectangular images also provide meaningful gains (~7% ADE) despite distortion, with biomechanical cues offering modest extra gains. The results offer practical guidance on feature selection and sensor configuration for efficient, socially aware robot navigation in crowded environments, including recommendations on camera placement and the feasibility of panoramic cues for motion forecasting.
Abstract
Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.
