Table of Contents
Fetching ...

GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

Sungsik Kim, Janghyun Baek, Jinkyu Kim, Jaekoo Lee

TL;DR

GUIDE-CoT addresses the challenge of predicting full pedestrian trajectories by integrating a goal-oriented visual prompt with a chain-of-thought-inspired LLM. The approach uses a visual prompt and a pretrained visual encoder to produce accurate goal cues, then feeds structured reasoning prompts into an LLM to generate trajectories toward those goals, with an added user-guidance mechanism for directional or group-based adjustments. Training is decoupled into a visual-prompt goal predictor and a CoT LLM trajectory generator, achieving state-of-the-art results on ETH/UCY and offering controllable trajectory generation. This multimodal framework enhances interpretability and adaptability of pedestrian trajectory prediction in dynamic urban environments, with public code available for replication.

Abstract

While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at https://github.com/ai-kmu/GUIDE-CoT.

GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

TL;DR

GUIDE-CoT addresses the challenge of predicting full pedestrian trajectories by integrating a goal-oriented visual prompt with a chain-of-thought-inspired LLM. The approach uses a visual prompt and a pretrained visual encoder to produce accurate goal cues, then feeds structured reasoning prompts into an LLM to generate trajectories toward those goals, with an added user-guidance mechanism for directional or group-based adjustments. Training is decoupled into a visual-prompt goal predictor and a CoT LLM trajectory generator, achieving state-of-the-art results on ETH/UCY and offering controllable trajectory generation. This multimodal framework enhances interpretability and adaptability of pedestrian trajectory prediction in dynamic urban environments, with public code available for replication.

Abstract

While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at https://github.com/ai-kmu/GUIDE-CoT.

Paper Structure

This paper contains 16 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between conventional LLM-based methods and our approach for pedestrian trajectory prediction. (a) Conventional approaches often leverage LLM's reasoning capability to predict pedestrians' future trajectories conditioned on textual contexts, which contain their past observations and scene descriptions (from an off-the-shelf image captioning model). (b) Our proposed method, called GUIDE-CoT, further improves the model's prediction performance by predicting pedestrians' final goal position given the scene image with overlaid visual prompts (i.e., a red arrow). Such predicted goal position is then augmented into an LLM in a similar manner to Chain-of-Thought (CoT), offering rich intermediate reasoning contexts.
  • Figure 2: An overview of our proposed approach, called GUIDE-CoT. (a) Our model first predicts each pedestrian's final goal position given (i) pedestrians' past observations, (ii) semantic BEV map, and (iii) top-down view scene image with visual prompt (i.e., a red arrow). Our model generates a sentence describing their final positions, such as "Pedestrian 0 will arrive at coordinate (57, 95) after the next 12 frames." (b) Such a generated goal description is then augmented into the LLM in a similar way to Chain-of-Thought reasoning, generating the final trajectory of each pedestrian.
  • Figure 3: Our goal-conditioned model further allows the user to provide text-driven guidance, controlling the model's trajectory prediction process. Such guidance may contain (b) a directional guide (e.g., "make the pedestrian walk to the right") or (c) a positional guide (e.g., "make the pedestrian join a group with the neighbor"). Compare results with and without user-provided guidance, i.e., (a) vs. (b) and (c).
  • Figure 4: Examples of variants of our used visual prompts with different colors (i.e., red, blue, and green) and shapes (i.e., arrow and points).
  • Figure 5: Visualization of controllable trajectory generation based on user guidance. The red points represent the pedestrian’s observed trajectory, and the stars indicate the generated goals according to the goal-oriented visual prompt. Points in other colors show predicted trajectories from goal adjustments, illustrating variations like stopping, turning, and grouping with nearby pedestrians. These trajectories demonstrate the model’s adaptability to user-guided trajectory generation.