Table of Contents
Fetching ...

Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models

Zhengxing Lan, Hongbo Li, Lingshan Liu, Bo Fan, Yisheng Lv, Yilong Ren, Zhiyong Cui

TL;DR

This work investigates leveraging pre-trained Large Language Models (LLMs) for autonomous-vehicle trajectory prediction without explicit prompt engineering. By introducing sparse context joint encoding, a lane-aware Mamba module, and a multi-modal Laplace decoder, Traj-LLM enables LLMs to capture high-level scene knowledge and interactions for multi-trajectory forecasting. The approach achieves state-of-the-art results on nuScenes, with strong few-shot performance and efficient inference, while ablations confirm the importance of both the LLM-based high-level modeling and lane-focused guidance. Overall, Traj-LLM presents a universal, adaptable framework that expands the role of LLMs in motion forecasting beyond prompting, enabling robust, multi-modal predictions in complex driving scenes.

Abstract

Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explicit prompt engineering to generate future motion from agents' past/observed trajectories and scene semantics. Traj-LLM starts with sparse context joint coding to dissect the agent and scene features into a form that LLMs understand. On this basis, we innovatively explore LLMs' powerful comprehension abilities to capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function and enhancing Traj-LLM's scene comprehension, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module. Finally, a multi-modal Laplace decoder is designed to achieve scene-compliant multi-modal predictions. Extensive experiments manifest that Traj-LLM, fortified by LLMs' strong prior knowledge and understanding prowess, together with lane-aware probability learning, outstrips state-of-the-art methods across evaluation metrics. Moreover, the few-shot analysis further substantiates Traj-LLM's performance, wherein with just 50% of the dataset, it outperforms the majority of benchmarks relying on complete data utilization. This study explores equipping the trajectory prediction task with advanced capabilities inherent in LLMs, furnishing a more universal and adaptable solution for forecasting agent motion in a new way.

Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models

TL;DR

This work investigates leveraging pre-trained Large Language Models (LLMs) for autonomous-vehicle trajectory prediction without explicit prompt engineering. By introducing sparse context joint encoding, a lane-aware Mamba module, and a multi-modal Laplace decoder, Traj-LLM enables LLMs to capture high-level scene knowledge and interactions for multi-trajectory forecasting. The approach achieves state-of-the-art results on nuScenes, with strong few-shot performance and efficient inference, while ablations confirm the importance of both the LLM-based high-level modeling and lane-focused guidance. Overall, Traj-LLM presents a universal, adaptable framework that expands the role of LLMs in motion forecasting beyond prompting, enabling robust, multi-modal predictions in complex driving scenes.

Abstract

Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explicit prompt engineering to generate future motion from agents' past/observed trajectories and scene semantics. Traj-LLM starts with sparse context joint coding to dissect the agent and scene features into a form that LLMs understand. On this basis, we innovatively explore LLMs' powerful comprehension abilities to capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function and enhancing Traj-LLM's scene comprehension, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module. Finally, a multi-modal Laplace decoder is designed to achieve scene-compliant multi-modal predictions. Extensive experiments manifest that Traj-LLM, fortified by LLMs' strong prior knowledge and understanding prowess, together with lane-aware probability learning, outstrips state-of-the-art methods across evaluation metrics. Moreover, the few-shot analysis further substantiates Traj-LLM's performance, wherein with just 50% of the dataset, it outperforms the majority of benchmarks relying on complete data utilization. This study explores equipping the trajectory prediction task with advanced capabilities inherent in LLMs, furnishing a more universal and adaptable solution for forecasting agent motion in a new way.
Paper Structure (19 sections, 16 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 16 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Framework of Traj-LLM.
  • Figure 2: The overview of Pre-trained LLMs.
  • Figure 3: The proposed Mamba layer for lane-aware probability learning.
  • Figure 4: Comparison of Traj-LLM with baseline models across three key metrics: trainable parameters, inference speed, and $\text{MR}_5$. The size of the circles in the figure corresponds to the number of trainable parameters in each model.
  • Figure 5: Comparison of Traj-LLM with baseline models across three key metrics: trainable parameters, inference speed, and $\text{MR}_{10}$. The size of the circles in the figure corresponds to the number of trainable parameters in each model.
  • ...and 4 more figures