LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

Yudong Liu; Spencer Hallyburton; Jiwoo Kim; Yueqian Lin; Yiming Li; Qinsi Wang; Hui Ye; Jingwei Sun; Miroslav Pajic; Yiran Chen; Hai Li

LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

Yudong Liu, Spencer Hallyburton, Jiwoo Kim, Yueqian Lin, Yiming Li, Qinsi Wang, Hui Ye, Jingwei Sun, Miroslav Pajic, Yiran Chen, Hai Li

TL;DR

This work tackles the fragility of end-to-end autonomous driving planners by introducing LLaViDA, a vision–language driving assistant that reasons about scenes to generate precise trajectories in a single turn. It leverages a Vision–Language Model to forecast multi-agent motion, ground scene semantics, infer ego intent, derive a meta-action, and output a numerically precise low-risk trajectory, trained via supervised fine-tuning followed by Trajectory Preference Optimization. A new NuScenes-TP dataset provides ground-truth meta-actions and GPT-4o-generated reasoning traces to supervise language-conditioned planning. On NuScenes, LLaViDA achieves state-of-the-art open-loop trajectory planning with lower average L2 error and fewer collisions, and the approach shows backbone-agnostic performance with efficiency optimizations for real-time deployment.

Abstract

Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.

LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

TL;DR

Abstract

LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)