Table of Contents
Fetching ...

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma

TL;DR

This work introduces AutoDriveRL, a reinforcement-learning framework that decomposes autonomous driving into four vision-language QA tasks—Perception, Prediction, Planning, and Behavior—and trains a unified model, DriveRX, with task-specific rewards to enable coherent cross-task reasoning. DriveRX serves as a high-level semantic backbone that produces interpretable, stage-wise reasoning traces while improving robustness under challenging and corrupted conditions. The authors also demonstrate downstream benefits by distilling DriveRX reasoning into DriveRX-Agent for trajectory prediction and DriveRX-VLA for action-level control, achieving competitive results in open- and closed-loop evaluations. The framework achieves state-of-the-art behavior reasoning on DriveBench and shows strong generalization, offering a promising direction for robust, interpretable planning and control in autonomous driving. The authors release AutoDriveRL and DriveRX resources to foster further research.

Abstract

Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. DriveRX serves as a high-level semantic reasoning backbone, producing structured stage-wise reasoning chains that enhance decision consistency. These outputs also provide high-quality supervisory signals for annotation and downstream planning/control models. We release the AutoDriveRL framework and DriveRX to support future research.

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

TL;DR

This work introduces AutoDriveRL, a reinforcement-learning framework that decomposes autonomous driving into four vision-language QA tasks—Perception, Prediction, Planning, and Behavior—and trains a unified model, DriveRX, with task-specific rewards to enable coherent cross-task reasoning. DriveRX serves as a high-level semantic backbone that produces interpretable, stage-wise reasoning traces while improving robustness under challenging and corrupted conditions. The authors also demonstrate downstream benefits by distilling DriveRX reasoning into DriveRX-Agent for trajectory prediction and DriveRX-VLA for action-level control, achieving competitive results in open- and closed-loop evaluations. The framework achieves state-of-the-art behavior reasoning on DriveBench and shows strong generalization, offering a promising direction for robust, interpretable planning and control in autonomous driving. The authors release AutoDriveRL and DriveRX resources to foster further research.

Abstract

Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. DriveRX serves as a high-level semantic reasoning backbone, producing structured stage-wise reasoning chains that enhance decision consistency. These outputs also provide high-quality supervisory signals for annotation and downstream planning/control models. We release the AutoDriveRL framework and DriveRX to support future research.

Paper Structure

This paper contains 42 sections, 3 equations, 24 figures, 16 tables.

Figures (24)

  • Figure 1: Overview of the AutoDriveRL framework. To tackle the challenge of cross-task reasoning in autonomous driving, AutoDriveRL decomposes complex scenarios into four core tasks—Perception, Prediction, Planning and Behavior—each formulated in a VQA style. These tasks form a structured reasoning chain and are jointly optimized via RL. During training, we design task-specific reward models for each task to provide fine-grained feedback. The resulting model, DriveRX, demonstrates strong generalization and robustness under challenging driving conditions.
  • Figure 2: Training curves of DriveRX over reinforcement learning steps.
  • Figure 3: Behavior task example comparing Align-DSV and DriveRX. The goal is to determine the ego vehicle’s behavior at an intersection. Align-DSV incorrectly outputs "going straight", while DriveRX correctly concludes "steering to the right" by performing structured reasoning. The behavior task is decomposed into four subtasks: Perception, Prediction, Planning, and final Behavior decision, allowing DriveRX to make more accurate and interpretable decisions.
  • Figure 4: The model score distribution based on DriveLM-Hard
  • Figure 5: The proportion of each task in the dataset
  • ...and 19 more figures