Table of Contents
Fetching ...

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving

Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, Liu Ren

TL;DR

ALN-P3 introduces a training-only cross-modal co-distillation framework that unifies language reasoning with perception, prediction, and planning in autonomous driving. By implementing three alignment modules—Perception Alignment ($P1A$), Prediction Alignment ($P2A$), and Planning Alignment ($P3A$)—the approach ties BEV visual tokens to language outputs across the P3 stack, while keeping inference cost unchanged. Across nuScenes, Nu-X, TOD3Cap, and nuScenes-QA, ALN-P3 delivers state-of-the-art results in planning safety and multimodal reasoning, including substantial gains in collision reduction and explanation quality. The method enhances interpretability and grounding by aligning intermediate driving representations with language without sacrificing real-time performance, offering practical benefits for deployable autonomous driving systems.

Abstract

Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between "fast" vision-based autonomous driving systems and "slow" language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving

TL;DR

ALN-P3 introduces a training-only cross-modal co-distillation framework that unifies language reasoning with perception, prediction, and planning in autonomous driving. By implementing three alignment modules—Perception Alignment (), Prediction Alignment (), and Planning Alignment ()—the approach ties BEV visual tokens to language outputs across the P3 stack, while keeping inference cost unchanged. Across nuScenes, Nu-X, TOD3Cap, and nuScenes-QA, ALN-P3 delivers state-of-the-art results in planning safety and multimodal reasoning, including substantial gains in collision reduction and explanation quality. The method enhances interpretability and grounding by aligning intermediate driving representations with language without sacrificing real-time performance, offering practical benefits for deployable autonomous driving systems.

Abstract

Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between "fast" vision-based autonomous driving systems and "slow" language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.

Paper Structure

This paper contains 24 sections, 8 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of the proposed ALN-P3 framework, which integrates vision and language alignment across the full autonomous driving stack. The architecture includes three alignment modules: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A). These modules align BEV-based visual tokens, such as instance, motion, and ego features, with corresponding natural language outputs through cross-modal prompts and projection heads. All alignments are applied only during training and introduce no additional computation at inference time, enabling efficient and interpretable reasoning for perception, prediction, and planning tasks.