Table of Contents
Fetching ...

OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic

Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, Chen Lv

TL;DR

OpenREAD tackles the challenge of open-ended reasoning in end-to-end autonomous driving by integrating reinforcement fine-tuning with an LLM-as-Critic. It constructs Chain-of-Thought driving knowledge data and jointly optimizes knowledge reasoning and trajectory planning using GRPO, guided by Qwen3-LLM rewards and semantic similarity cues. The approach yields improvements in both knowledge evaluation and planning accuracy, demonstrated on NuScenes/OmniDrive-based tasks, and shows data-efficient gains through reinforced exploration. While achieving strong performance, the work notes substantial computational costs and highlights future work in scaling knowledge data for broader generalization.

Abstract

Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic

TL;DR

OpenREAD tackles the challenge of open-ended reasoning in end-to-end autonomous driving by integrating reinforcement fine-tuning with an LLM-as-Critic. It constructs Chain-of-Thought driving knowledge data and jointly optimizes knowledge reasoning and trajectory planning using GRPO, guided by Qwen3-LLM rewards and semantic similarity cues. The approach yields improvements in both knowledge evaluation and planning accuracy, demonstrated on NuScenes/OmniDrive-based tasks, and shows data-efficient gains through reinforced exploration. While achieving strong performance, the work notes substantial computational costs and highlights future work in scaling knowledge data for broader generalization.

Abstract

Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

Paper Structure

This paper contains 26 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison between RFT and SFT with increased driving-related knowledge. Traj. denotes the data of trajectory planning only, and C.T. denotes the data of counterfactual trajectory analysis. Extending driving knowledge through RFT leads to a notable improvement on the NuScenes dataset compared with SFT.
  • Figure 2: The training pipeline of our OpenREAD. For the cold start stage, we utilize the CoT annotated data for SFT, followed by RFT with GRPO to further enhance the reasoning capabilities.
  • Figure 3: An overview of the prompt templates used for CoT annotation generations (Left) and Qwen3-LLM open-ended driving knowledge evaluation (Right).
  • Figure 4: An overview of our CoT annotations for driving-related knowledge and trajectory planning.
  • Figure 5: OpenREAD performs a less-conservative planning when entering the intersection.
  • ...and 6 more figures