Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

Nhat Le; Daeun Song; Xuesu Xiao

Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

Nhat Le, Daeun Song, Xuesu Xiao

TL;DR

This work investigates which human skeletal cues best predict multi-agent trajectories from egocentric robot perception, focusing on lower-body 3D keypoints and derived biomechanical cues. Using the Human Scene Transformer backbone, the authors systematically compare 3D vs 2D pose inputs and full vs lower-body regions on JRDB and a new 360° panoramic dataset, evaluating with $MinADE$, $MinFDE$, $MLADE$, and $NLL_{pos}$. They find that lower-body 3D keypoints yield the largest improvements (e.g., ~13% ADE reduction) and that 2D keypoints from equirectangular images also provide meaningful gains (~7% ADE) despite distortion, with biomechanical cues offering modest extra gains. The results offer practical guidance on feature selection and sensor configuration for efficient, socially aware robot navigation in crowded environments, including recommendations on camera placement and the feasibility of panoramic cues for motion forecasting.

Abstract

Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.

Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

TL;DR

, and

. They find that lower-body 3D keypoints yield the largest improvements (e.g., ~13% ADE reduction) and that 2D keypoints from equirectangular images also provide meaningful gains (~7% ADE) despite distortion, with biomechanical cues offering modest extra gains. The results offer practical guidance on feature selection and sensor configuration for efficient, socially aware robot navigation in crowded environments, including recommendations on camera placement and the feasibility of panoramic cues for motion forecasting.

Abstract

Paper Structure (20 sections, 4 equations, 3 figures, 3 tables)

This paper contains 20 sections, 4 equations, 3 figures, 3 tables.

INTRODUCTION
Related work
Human Trajectory Prediction for Social Navigation
Human Pose Detection and Intent Prediction
Egocentric Perception Datasets
Methodology
Problem formulation
Datasets
Implementation Details
Evaluation
Results and Discussion
Predictive Value of Lower-body Keypoints
3D versus 2D Keypoints on JRDB
2D Keypoints from Equirectangular Images
Discussion
...and 5 more sections

Figures (3)

Figure 0: Diagram for 33-keypoint 3D skeletal pose with $K^{3D}_L$ and $K^{3D}_U$ enclosed in green and orange boxes, respectively. Adapted from MediaPipePosemediapipe-pose.
Figure 1: Our robot AgileX Scout Mini setup with onboard sensors: Insta360 X4 360° camera, Zed 2 RGB camera, Velodyne VLP-16 LiDAR, and WitMotion IMU.
Figure 2: Sample panoramic images from JRDB (left) and our dataset (right). In JRDB, nearby humans are cropped due to the camera's limited vertical field of view, while in our dataset, full 2D keypoints can be detected even at distances below 1m.

Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

TL;DR

Abstract

Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (3)