DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao; Lele Yang; Hongbo Yin; Zhexu Wang; Yejie Wang; Daxin Tian; Kongming Liang; Zhanyu Ma

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma

TL;DR

This work introduces AutoDriveRL, a reinforcement-learning framework that decomposes autonomous driving into four vision-language QA tasks—Perception, Prediction, Planning, and Behavior—and trains a unified model, DriveRX, with task-specific rewards to enable coherent cross-task reasoning. DriveRX serves as a high-level semantic backbone that produces interpretable, stage-wise reasoning traces while improving robustness under challenging and corrupted conditions. The authors also demonstrate downstream benefits by distilling DriveRX reasoning into DriveRX-Agent for trajectory prediction and DriveRX-VLA for action-level control, achieving competitive results in open- and closed-loop evaluations. The framework achieves state-of-the-art behavior reasoning on DriveBench and shows strong generalization, offering a promising direction for robust, interpretable planning and control in autonomous driving. The authors release AutoDriveRL and DriveRX resources to foster further research.

Abstract

Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. DriveRX serves as a high-level semantic reasoning backbone, producing structured stage-wise reasoning chains that enhance decision consistency. These outputs also provide high-quality supervisory signals for annotation and downstream planning/control models. We release the AutoDriveRL framework and DriveRX to support future research.

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

TL;DR

Abstract

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)