Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction
Yizhou Huang, Yihua Cheng, Kezhi Wang
TL;DR
This work tackles the challenge of predicting pedestrian trajectories under long-tail behavior by moving beyond fully supervised learning to a self-supervised framework that explicitly models position, velocity, and acceleration. It introduces a three-stream encoder–decoder network with hierarchical feature fusion, where velocity informs position and acceleration informs velocity, and employs a motion-consistency mechanism to derive pseudo-labels from predicted motion. The model optimizes multiple losses, including $L_{pos}$, $L_{va}$, $L_{cons1}$, and $L_{cons2}$, to align predictions across the three motion states and selects the most coherent velocity/acceleration pair to supervise pseudo-values via cross-entropy. Across ETH-UCY and SDD benchmarks, the approach achieves state-of-the-art results, demonstrates strong ablations, and shows robustness to different prediction horizons and hypothesis counts, with practical implications for safer, more reliable autonomous systems.
Abstract
Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.
