Table of Contents
Fetching ...

LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

Yudong Liu, Spencer Hallyburton, Jiwoo Kim, Yueqian Lin, Yiming Li, Qinsi Wang, Hui Ye, Jingwei Sun, Miroslav Pajic, Yiran Chen, Hai Li

TL;DR

This work tackles the fragility of end-to-end autonomous driving planners by introducing LLaViDA, a vision–language driving assistant that reasons about scenes to generate precise trajectories in a single turn. It leverages a Vision–Language Model to forecast multi-agent motion, ground scene semantics, infer ego intent, derive a meta-action, and output a numerically precise low-risk trajectory, trained via supervised fine-tuning followed by Trajectory Preference Optimization. A new NuScenes-TP dataset provides ground-truth meta-actions and GPT-4o-generated reasoning traces to supervise language-conditioned planning. On NuScenes, LLaViDA achieves state-of-the-art open-loop trajectory planning with lower average L2 error and fewer collisions, and the approach shows backbone-agnostic performance with efficiency optimizations for real-time deployment.

Abstract

Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.

LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

TL;DR

This work tackles the fragility of end-to-end autonomous driving planners by introducing LLaViDA, a vision–language driving assistant that reasons about scenes to generate precise trajectories in a single turn. It leverages a Vision–Language Model to forecast multi-agent motion, ground scene semantics, infer ego intent, derive a meta-action, and output a numerically precise low-risk trajectory, trained via supervised fine-tuning followed by Trajectory Preference Optimization. A new NuScenes-TP dataset provides ground-truth meta-actions and GPT-4o-generated reasoning traces to supervise language-conditioned planning. On NuScenes, LLaViDA achieves state-of-the-art open-loop trajectory planning with lower average L2 error and fewer collisions, and the approach shows backbone-agnostic performance with efficiency optimizations for real-time deployment.

Abstract

Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.

Paper Structure

This paper contains 36 sections, 14 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Three paradigms of tackling trajectory planning task in end-to-end autonomous driving.
  • Figure 2: Construction pipeline of the proposed NuScenes-TP dataset. Starting from the raw NuScenes data, we extract ego and object states, derive their corresponding future trajectories, and further compute ego meta-actions from the ego trajectory. In parallel, GPT-4o is used to generate reasoning annotations, which are then validated against the ground-truth meta-actions.
  • Figure 3: Overview of the proposed LLaViDA framework. LLaViDA models trajectory planning as a multi-object motion-prediction problem. By explicitly predicting the motion of key objects in the scene (yellow and purple traces) and imitating human driver reasoning, it generates an accurate ego trajectory (green) in a causally grounded manner.
  • Figure 4: Representative qualitative results. All trajectories are overlaid on the same front-view camera images: red indicates the ground-truth trajectory, green represents the prediction from our method, and cyan denotes the baseline prediction. Text boxes contain the corresponding textual output from our pipeline.
  • Figure 5: Visualization sampled from NuScenes test split. Ground truth trajectory in red and predicted trajectory in green.