Table of Contents
Fetching ...

LG-Traj: LLM Guided Pedestrian Trajectory Prediction

Pranav Singh Chib, Pravendra Singh

TL;DR

LG-Traj presents a novel framework that leverages Large Language Models to extract past motion cues from observed pedestrian trajectories and complements them with future motion cues learned through a Gaussian mixture model of future trajectories. A rank-$k$ singular value decomposition augmentation, a transformer-based motion encoder, and a social decoder jointly model motion patterns and social interactions to predict multiple plausible futures with associated probabilities. The approach achieves state-of-the-art performance on ETH-UCY and Stanford Drone Dataset (SDD) benchmarks, supported by extensive ablations showing the critical roles of motion cues, positional encoding, and trajectory augmentation. By delivering both trajectory predictions and their uncertainty, LG-Traj offers a principled and practical advancement for robust pedestrian trajectory forecasting in dynamic environments.

Abstract

Accurate pedestrian trajectory prediction is crucial for various applications, and it requires a deep understanding of pedestrian motion patterns in dynamic environments. However, existing pedestrian trajectory prediction methods still need more exploration to fully leverage these motion patterns. This paper investigates the possibilities of using Large Language Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to generate motion cues present in pedestrian past/observed trajectories. Our approach also incorporates motion cues present in pedestrian future trajectories by clustering future trajectories of training data using a mixture of Gaussians. These motion cues, along with pedestrian coordinates, facilitate a better understanding of the underlying representation. Furthermore, we utilize singular value decomposition to augment the observed trajectories, incorporating them into the model learning process to further enhance representation learning. Our method employs a transformer-based architecture comprising a motion encoder to model motion patterns and a social decoder to capture social interactions among pedestrians. We demonstrate the effectiveness of our approach on popular pedestrian trajectory prediction benchmarks, namely ETH-UCY and SDD, and present various ablation experiments to validate our approach.

LG-Traj: LLM Guided Pedestrian Trajectory Prediction

TL;DR

LG-Traj presents a novel framework that leverages Large Language Models to extract past motion cues from observed pedestrian trajectories and complements them with future motion cues learned through a Gaussian mixture model of future trajectories. A rank- singular value decomposition augmentation, a transformer-based motion encoder, and a social decoder jointly model motion patterns and social interactions to predict multiple plausible futures with associated probabilities. The approach achieves state-of-the-art performance on ETH-UCY and Stanford Drone Dataset (SDD) benchmarks, supported by extensive ablations showing the critical roles of motion cues, positional encoding, and trajectory augmentation. By delivering both trajectory predictions and their uncertainty, LG-Traj offers a principled and practical advancement for robust pedestrian trajectory forecasting in dynamic environments.

Abstract

Accurate pedestrian trajectory prediction is crucial for various applications, and it requires a deep understanding of pedestrian motion patterns in dynamic environments. However, existing pedestrian trajectory prediction methods still need more exploration to fully leverage these motion patterns. This paper investigates the possibilities of using Large Language Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to generate motion cues present in pedestrian past/observed trajectories. Our approach also incorporates motion cues present in pedestrian future trajectories by clustering future trajectories of training data using a mixture of Gaussians. These motion cues, along with pedestrian coordinates, facilitate a better understanding of the underlying representation. Furthermore, we utilize singular value decomposition to augment the observed trajectories, incorporating them into the model learning process to further enhance representation learning. Our method employs a transformer-based architecture comprising a motion encoder to model motion patterns and a social decoder to capture social interactions among pedestrians. We demonstrate the effectiveness of our approach on popular pedestrian trajectory prediction benchmarks, namely ETH-UCY and SDD, and present various ablation experiments to validate our approach.
Paper Structure (30 sections, 20 equations, 6 figures, 6 tables)

This paper contains 30 sections, 20 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The overview of our proposed LG-Traj involves taking multiple inputs including past motion cues, past observed trajectory, and future motion cues. First, we augment the given observed trajectory using rank-k approximation via singular value decomposition (SVD). Then for the subsequent steps, we either use the original past observed trajectory or augmented past observed trajectory. Next, we generate past motion cues ($M_i$) from LLM using the past observed trajectory ($X_i$) of the $i^{th}$ pedestrian. Tokenizer output ($T_i$) is generated from $M_i$ by the tokenizer. Past motion cues embedding ($Z_m$) is obtained by a linear transformation of $T_i$. Past trajectory embedding ($Z_p$) is obtained by a linear transformation of $X_i$. Cluster embedding $Z_c$ is obtained by a linear transformation of trajectory clusters. Trajectory clusters are generated by clustering future trajectories of training data using a mixture of Gaussians. Positional encoding is added to the concatenated embeddings ($Z_m, Z_p, Z_c$), and the result is passed as an input to the motion encoder to model the motion patterns. The embedding generated by the motion encoder ($Z_e$) along with neighbour embedding ($Z_{ne}$) is passed as an input to the social decoder to predict future trajectories.
  • Figure 2: Illustration of the mixture of Gaussians, where each Gaussian represents a diverse cluster of trajectories.
  • Figure 3: Illustration of input prompt and examples of motion cues generation from the LLM. We present three different examples where the LLM correctly identifies the underlying trajectory motion pattern, such as linear motion, curved motion, and standing still, based on the coordinates provided as input to the LLM.
  • Figure 4: Illustration of predicted trajectories from ETH (first column), UNIV (second column), HOTEL (third column), and ZARA (fourth column) datasets. Predicted pedestrian trajectories are highlighted in yellow. The observed trajectories are indicated in orange, while the ground truth trajectories are depicted in green. Our method demonstrates the prediction of future trajectories (yellow), closely matching the ground truth trajectories.
  • Figure 5: Visualization of augmented trajectories for three pedestrians sampled from SDD using different $k$ values in rank-$k$ approximation.
  • ...and 1 more figures