Table of Contents
Fetching ...

Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen

TL;DR

This paper introduces LinkVLA, a novel architecture that establishes a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model, and introduces an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping.

Abstract

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

Unifying Language-Action Understanding and Generation for Autonomous Driving

TL;DR

This paper introduces LinkVLA, a novel architecture that establishes a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model, and introduces an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping.

Abstract

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
Paper Structure (29 sections, 7 equations, 6 figures, 9 tables)

This paper contains 29 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: LinkVLA achieves both higher performance and lower latency in closed-loop evaluation and equipped with superior instruction-following capability.
  • Figure 2: An overview of the LinkVLA architecture. The model comprises a pretrained InternViT chen2024internvl visual backbone and a Qwen2-0.5B team2024qwen2 LLM. At its core, LinkVLA unifies language tokens and action tokens (for navigation points and trajectories) into a single, shared codebook. Training is driven by a unified objective for both language-action understanding and generation, ensuring deep semantic alignment. Inference employs an efficient coarse-to-fine process: first, (a) the model predicts the trajectory endpoint, which is then (b) interpolated into coarse waypoints before being (c) refined into the final, smooth trajectory.
  • Figure 3: Illustration of the action understanding (Left) and the action generation (Right).
  • Figure 4: Visualization in challenging environment with various language instructions. The generated trajectory accurately adheres to the language instruction while remaining safe and feasible within the complex environment.
  • Figure S1: Comparison of uniform and log token grids, with the corresponding waypoint distributions under each grid.
  • ...and 1 more figures