AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Teng Wang; Yanting Lu; Ruize Wang

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Teng Wang, Yanting Lu, Ruize Wang

Abstract

We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents waypoints with point tokens as categorical and positional markers while encoding waypoint numerical values as corresponding point embeddings, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Abstract

Paper Structure (15 sections, 4 equations, 6 figures, 3 tables)

This paper contains 15 sections, 4 equations, 6 figures, 3 tables.

Introduction
Related Work
Method
Overall Framework
Trajectory Tokenization
Chain-of-Thought Reasoning with VLMs
Training and Inference
Experiments
Experiment Settings
Main Results
Cross-Scene Trajectory Prediction
Extending to Long-term Trajectory Prediction
Ablation Study
Visualizations of Predicted Trajectory
Conclusion

Figures (6)

Figure 1: Comparison between our AutoTraces and previous LLM solutions. Our method introduces point tokens and embeddings for waypoint representation, enabling autoregressive prediction.
Figure 2: The overall framework of our AutoTraces built upon LLaVa-Video Zhang2024lLLaVaVideo for socially compliant trajectory forecasting. In the first stage, the model is pre-trained on video-text pairs with reasoning prompts to acquire thinking knowledge. The second stage fine-tunes the model using trajectory waypoints (via a Point Encoder), aligning visual observations (via a Vision Encoder) and goal locations (via a Point Encoder). During inference, the model generates future trajectory waypoints as $<$point$>$ tokens, enabling autoregressive predictions.
Figure 3: Illustration of the CoT generation for AutoTraces, incorporating visual observations and trajectory analysis. Red points and lines denote the historical trajectory, while blue points and lines denote the ground-truth trajectory. Action annotations (R: right, L: left, S: straight) are marked along the trajectory.
Figure 4: Ablation study of our learnable waypoint tokenization scheme and CoT reasoning on SCANDScan2022 dataset.
Figure 5: Visualizations of predicted trajectory from our AutoTraces and other baselines on SCAND Scan2022 dataset.
...and 1 more figures

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Abstract

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Authors

Abstract

Table of Contents

Figures (6)