Table of Contents
Fetching ...

iMotion-LLM: Instruction-Conditioned Trajectory Generation

Abdulwahab Felemban, Nussair Hroub, Jian Ding, Eslam Abdelrahman, Xiaoqian Shen, Abduallah Mohamed, Mohamed Elhoseiny

TL;DR

iMotion-LLM presents a novel framework that integrates a large language model with trajectory prediction modules to enable instruction-conditioned trajectory generation for autonomous driving. By introducing two datasets (InstructWaymo and Open-Vocabulary InstructNuPlan) and the Instruction Following Recall (IFR) metric, the approach rigorously evaluates instruction adherence alongside trajectory quality and safety. The method maps scene features into the LLM input space and uses an LLM-grounded conditioning pipeline to produce interpretable execution plans and safety justifications, achieving strong IFR and safety performance while enabling text-guided scenario generation. Ablation studies, comparisons with language-conditioned baselines, and closed-loop, safety-focused evaluations underscore the framework’s potential for offline safety testing, simulation, and robust reasoning about driving behavior under natural language instructions.

Abstract

We introduce iMotion-LLM, a large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a trajectory decoder for text-based interaction and interpretable driving. To support this framework, we introduce two datasets: 1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and 2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM demonstrates strong contextual comprehension, achieving 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Our code, pre-trained model, and datasets are available at: https://vision-cair.github.io/iMotion-LLM/.

iMotion-LLM: Instruction-Conditioned Trajectory Generation

TL;DR

iMotion-LLM presents a novel framework that integrates a large language model with trajectory prediction modules to enable instruction-conditioned trajectory generation for autonomous driving. By introducing two datasets (InstructWaymo and Open-Vocabulary InstructNuPlan) and the Instruction Following Recall (IFR) metric, the approach rigorously evaluates instruction adherence alongside trajectory quality and safety. The method maps scene features into the LLM input space and uses an LLM-grounded conditioning pipeline to produce interpretable execution plans and safety justifications, achieving strong IFR and safety performance while enabling text-guided scenario generation. Ablation studies, comparisons with language-conditioned baselines, and closed-loop, safety-focused evaluations underscore the framework’s potential for offline safety testing, simulation, and robust reasoning about driving behavior under natural language instructions.

Abstract

We introduce iMotion-LLM, a large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a trajectory decoder for text-based interaction and interpretable driving. To support this framework, we introduce two datasets: 1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and 2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM demonstrates strong contextual comprehension, achieving 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Our code, pre-trained model, and datasets are available at: https://vision-cair.github.io/iMotion-LLM/.
Paper Structure (35 sections, 1 equation, 11 figures, 13 tables, 2 algorithms)

This paper contains 35 sections, 1 equation, 11 figures, 13 tables, 2 algorithms.

Figures (11)

  • Figure 1: iMotion-LLM generates feasible, safety-aligned trajectories from human instructions. It uses two datasets, InstructWaymo (direction-based) and Open-Vocabulary InstructNuPlan (safety-focused), to support instruction-conditioned generation and justification.
  • Figure 2: Open-Vocabulary InstructNuPlan pipeline for generating safe/unsafe instruction-caption pairs using GPT-4o mini, guided by scenario type, motion behavior, and safety metadata.
  • Figure 3: Given a textual instruction and scene context embeddings, iMotion-LLM employs an LLM Projection module to map the encoded scene token embeddings from the Scene Encoder into the LLM input space. The LLM then generates an updated ego-vehicle token, an instruction token, and caption text tokens. The instruction token is projected into a query $Q'$, while the ego-vehicle token is projected to form the key and value representing the ego-vehicle embedding, which is subsequently used by the Multimodal Trajectory Decoder. Newly added components are highlighted in orange.
  • Figure 4: iMotion-LLM Qualitative Results. Qualitative results showcasing three unsafe instructions (left) and three safe instructions (right). The results demonstrate iMotion-LLM's capability to generate relevant trajectories, assess the safety of given instructions, and provide reasoning for its decisions.
  • Figure 5: Qualitative comparison of three challenging InstructWaymo scenarios. Non-conditional GameFormer in red, C-GameFormer in blue, and iMotion-LLM in green.
  • ...and 6 more figures