Table of Contents
Fetching ...

Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

Yizhou Huang, Yihua Cheng, Kezhi Wang

TL;DR

This work tackles the challenge of predicting pedestrian trajectories under long-tail behavior by moving beyond fully supervised learning to a self-supervised framework that explicitly models position, velocity, and acceleration. It introduces a three-stream encoder–decoder network with hierarchical feature fusion, where velocity informs position and acceleration informs velocity, and employs a motion-consistency mechanism to derive pseudo-labels from predicted motion. The model optimizes multiple losses, including $L_{pos}$, $L_{va}$, $L_{cons1}$, and $L_{cons2}$, to align predictions across the three motion states and selects the most coherent velocity/acceleration pair to supervise pseudo-values via cross-entropy. Across ETH-UCY and SDD benchmarks, the approach achieves state-of-the-art results, demonstrates strong ablations, and shows robustness to different prediction horizons and hypothesis counts, with practical implications for safer, more reliable autonomous systems.

Abstract

Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.

Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

TL;DR

This work tackles the challenge of predicting pedestrian trajectories under long-tail behavior by moving beyond fully supervised learning to a self-supervised framework that explicitly models position, velocity, and acceleration. It introduces a three-stream encoder–decoder network with hierarchical feature fusion, where velocity informs position and acceleration informs velocity, and employs a motion-consistency mechanism to derive pseudo-labels from predicted motion. The model optimizes multiple losses, including , , , and , to align predictions across the three motion states and selects the most coherent velocity/acceleration pair to supervise pseudo-values via cross-entropy. Across ETH-UCY and SDD benchmarks, the approach achieves state-of-the-art results, demonstrates strong ablations, and shows robustness to different prediction horizons and hypothesis counts, with practical implications for safer, more reliable autonomous systems.

Abstract

Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.

Paper Structure

This paper contains 18 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our method takes historical pedestrian trajectories as inputs. From the historical trajectory, we extract position, velocity, and acceleration (accel.) information, which are then fed into a three-stream network. In the figure, the letter $d$ represents the amount of change. The network hierarchically fuses features across the three streams and enforces motion consistency among position, velocity, and accel predictions during training. The output consists of multiple possible future position trajectories.
  • Figure 2: Our method takes position, velocity, and acceleration as inputs to a three-stream network. This network utilizes a transformer encoder to capture temporal information and outputs the features of position, velocity, and acceleration for each pedestrian. Velocity features are hierarchically injected to aid position prediction, while acceleration features are incorporated to support velocity prediction. A social decoder is then employed to predict $K$ potential trajectories based on these features. The decoder applies an attention mechanism to neighboring features, capturing interactions between pedestrians. Finally, we propose a self-supervised strategy to ensure motion consistency among position, velocity, and acceleration predictions. We group the velocity and acceleration predictions and apply $\mathcal{L}_{\text{cons1}}$ to ensure consistency within each group. Two heuristic strategies are defined to evaluate the velocity and acceleration predictions, with learnable weights assigned to these evaluations. Based on these evaluations, the network selects a velocity-acceleration group and applies $\mathcal{L}_{\text{cons2}}$ to enforce motion consistency between the position and the selected group.
  • Figure 3: Visualizations of position and velocity trajectories reveal that the predicted trajectory distribution struggles to accurately account for sudden changes in a pedestrian's direction when motion consistency is not applied (w/o motion consistency). In contrast, our method considers the pedestrian's acceleration and deceleration patterns, enabling more precise trajectory predictions aligned with the direction of the historical trajectory.
  • Figure 4: The left figure shows predictions based solely on position information while the person changes movement direction. The ground truth is highlighted in green. The middle figure shows the predictions of our method, where speed changes are visualized as fast and slow. In this case, the person adjusts their movement direction with a slow-down signal. The right figure illustrates a scenario where the person aims to maintain their movement direction, ensuring that the speed does not decrease. This figure shows the advantage of using position, velocity and acceleration information.