Table of Contents
Fetching ...

Making Large Language Models Better Planners with Reasoning-Decision Alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, Xiaodan Liang

TL;DR

This work tackles misalignment between CoT reasoning and decisions in large language model–based autonomous driving by introducing RDA-Driver, a multimodal LLM that jointly performs reasoning and trajectory planning under a reasoning-decision alignment constraint. It combines a BEV-based vision encoder with a region-based adapter to convert visual information into language-friendly tokens, and it uses redesigned multi-turn CoTs covering perception, prediction, decision-making, and planning. The model is trained with vanilla CoT fine-tuning plus a contrastive alignment loss that enforces consistency between the CoT outputs and the final plan, using both model-generated and data-generated samples. On nuScenes and DriveLM-nuScenes, RDA-Driver achieves state-of-the-art planning performance with competitive L2 errors and low collision rates, while also delivering improved interpretability and robustness through aligned reasoning and decision outcomes.

Abstract

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

Making Large Language Models Better Planners with Reasoning-Decision Alignment

TL;DR

This work tackles misalignment between CoT reasoning and decisions in large language model–based autonomous driving by introducing RDA-Driver, a multimodal LLM that jointly performs reasoning and trajectory planning under a reasoning-decision alignment constraint. It combines a BEV-based vision encoder with a region-based adapter to convert visual information into language-friendly tokens, and it uses redesigned multi-turn CoTs covering perception, prediction, decision-making, and planning. The model is trained with vanilla CoT fine-tuning plus a contrastive alignment loss that enforces consistency between the CoT outputs and the final plan, using both model-generated and data-generated samples. On nuScenes and DriveLM-nuScenes, RDA-Driver achieves state-of-the-art planning performance with competitive L2 errors and low collision rates, while also delivering improved interpretability and robustness through aligned reasoning and decision outcomes.

Abstract

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.
Paper Structure (19 sections, 6 equations, 3 figures, 6 tables)

This paper contains 19 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Motivation of RDA-Driver. (a) visualizes the distribution of the CoT score (higher is better) and the decision error of the predicted trajectory (lower is better) of LLaVa liu2024visual, indicating the misalignment between the CoT reasoning the planning results. (b) shows an example of inconsistency between the CoT reasoning and the consequent decision. Although the model correctly reasons the status of the current scene, i.e., noticing the front car and determining that it is not moving, the decision-making process follows wrong plans for the ego vehicle to move forward.
  • Figure 2: Framework of RDA-Driver. RDA-Driver takes the multi-view images, ego status, and multi-turn CoT prompt as input, and simultaneously carries out CoT reasoning and planning results. We construct multiple reasoning-decision samples with misalignment from both the vanilla fine-tuned model and similar scenarios. During training, we compute the token-average score as a measure of CoT answers. We utilize proposed contrastive loss to ensure the scores of positive samples are higher than those of generated negative samples.
  • Figure 3: Illustrations of CoT prompt.