LG-Traj: LLM Guided Pedestrian Trajectory Prediction
Pranav Singh Chib, Pravendra Singh
TL;DR
LG-Traj presents a novel framework that leverages Large Language Models to extract past motion cues from observed pedestrian trajectories and complements them with future motion cues learned through a Gaussian mixture model of future trajectories. A rank-$k$ singular value decomposition augmentation, a transformer-based motion encoder, and a social decoder jointly model motion patterns and social interactions to predict multiple plausible futures with associated probabilities. The approach achieves state-of-the-art performance on ETH-UCY and Stanford Drone Dataset (SDD) benchmarks, supported by extensive ablations showing the critical roles of motion cues, positional encoding, and trajectory augmentation. By delivering both trajectory predictions and their uncertainty, LG-Traj offers a principled and practical advancement for robust pedestrian trajectory forecasting in dynamic environments.
Abstract
Accurate pedestrian trajectory prediction is crucial for various applications, and it requires a deep understanding of pedestrian motion patterns in dynamic environments. However, existing pedestrian trajectory prediction methods still need more exploration to fully leverage these motion patterns. This paper investigates the possibilities of using Large Language Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to generate motion cues present in pedestrian past/observed trajectories. Our approach also incorporates motion cues present in pedestrian future trajectories by clustering future trajectories of training data using a mixture of Gaussians. These motion cues, along with pedestrian coordinates, facilitate a better understanding of the underlying representation. Furthermore, we utilize singular value decomposition to augment the observed trajectories, incorporating them into the model learning process to further enhance representation learning. Our method employs a transformer-based architecture comprising a motion encoder to model motion patterns and a social decoder to capture social interactions among pedestrians. We demonstrate the effectiveness of our approach on popular pedestrian trajectory prediction benchmarks, namely ETH-UCY and SDD, and present various ablation experiments to validate our approach.
