Table of Contents
Fetching ...

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia, Kun Zhan, Xianpeng Lang, Yaran Chen, Dongbin Zhao

TL;DR

PlanAgent tackles the challenge of closed-loop vehicle motion planning by embedding a multi-modal large language model within a three-module agent: Environment Transformation, Reasoning Engine, and Reflection. It converts raw scene data into a BEV map and a lane-graph textual description, uses hierarchical chain-of-thought to generate IDM-based planner code, and validates plans via short-horizon simulations to curb MLLM uncertainty. The approach achieves state-of-the-art results on nuPlan Val14 and demonstrates strong generalization to Test14-hard, while also reducing token usage compared to text-only scene descriptions. These results suggest that integrating structured scene representations with hierarchical reasoning and safety-oriented reflection can yield robust, generalizable mid-to-mid planning in complex autonomous driving scenarios.

Abstract

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhile, learning-based methods have yet to achieve superior performance over rule-based approaches in large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid planning system based on a Multi-modal Large Language Model (MLLM). MLLM is used as a cognitive agent to introduce human-like knowledge, interpretability, and common-sense reasoning into the closed-loop planning. Specifically, PlanAgent leverages the power of MLLM through three core modules. First, an Environment Transformation module constructs a Bird's Eye View (BEV) map and a lane-graph-based textual description from the environment as inputs. Second, a Reasoning Engine module introduces a hierarchical chain-of-thought from scene understanding to lateral and longitudinal motion instructions, culminating in planner code generation. Last, a Reflection module is integrated to simulate and evaluate the generated planner for reducing MLLM's uncertainty. PlanAgent is endowed with the common-sense reasoning and generalization capability of MLLM, which empowers it to effectively tackle both common and complex long-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scale and challenging nuPlan benchmarks. A comprehensive set of experiments convincingly demonstrates that PlanAgent outperforms the existing state-of-the-art in the closed-loop motion planning task. Codes will be soon released.

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

TL;DR

PlanAgent tackles the challenge of closed-loop vehicle motion planning by embedding a multi-modal large language model within a three-module agent: Environment Transformation, Reasoning Engine, and Reflection. It converts raw scene data into a BEV map and a lane-graph textual description, uses hierarchical chain-of-thought to generate IDM-based planner code, and validates plans via short-horizon simulations to curb MLLM uncertainty. The approach achieves state-of-the-art results on nuPlan Val14 and demonstrates strong generalization to Test14-hard, while also reducing token usage compared to text-only scene descriptions. These results suggest that integrating structured scene representations with hierarchical reasoning and safety-oriented reflection can yield robust, generalizable mid-to-mid planning in complex autonomous driving scenarios.

Abstract

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhile, learning-based methods have yet to achieve superior performance over rule-based approaches in large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid planning system based on a Multi-modal Large Language Model (MLLM). MLLM is used as a cognitive agent to introduce human-like knowledge, interpretability, and common-sense reasoning into the closed-loop planning. Specifically, PlanAgent leverages the power of MLLM through three core modules. First, an Environment Transformation module constructs a Bird's Eye View (BEV) map and a lane-graph-based textual description from the environment as inputs. Second, a Reasoning Engine module introduces a hierarchical chain-of-thought from scene understanding to lateral and longitudinal motion instructions, culminating in planner code generation. Last, a Reflection module is integrated to simulate and evaluate the generated planner for reducing MLLM's uncertainty. PlanAgent is endowed with the common-sense reasoning and generalization capability of MLLM, which empowers it to effectively tackle both common and complex long-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scale and challenging nuPlan benchmarks. A comprehensive set of experiments convincingly demonstrates that PlanAgent outperforms the existing state-of-the-art in the closed-loop motion planning task. Codes will be soon released.
Paper Structure (24 sections, 4 equations, 6 figures, 7 tables)

This paper contains 24 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Quantitative results of non-reactive closed-loop motion planning on nuPlancaesar2021nuplan Val14 and Test14-hard benchmarks compared with the state-of-the-art rule-based method PDM-Closeddauner2023parting and learning-based method PlanTFcheng2023rethinking and DTPPhuang2023differentiable. Our proposed PlanAgent achieves state-of-the-art performance in common scenarios (Val14 benchmark) and demonstrated generalization in more challenging long-tailed scenarios (Test14-hard benchmark). Other methods either perform poorly in common scenarios or find it difficult to generalize to long-tailed scenarios. Please note that PDM-Closed, PlanTF, DTPP, and PlanAgent are denoted by purple, green, yellow, and orange, respectively. The best performances are represented in italics and underlined.
  • Figure 2: Based on a MLLM, we propose a novel planning agent pipeline comprising three modules: Environment Transformation, Reasoning Engine, and Reflection Module. In the Environment Transformation module, key information about the environment is extracted to form a BEV map and construct a lane-graph representation. Subsequently, the lane graph is translated into textual descriptions and used as scenario prompts along with the BEV map. In the Reasoning Engine module, an MLLM generates planner codes based on the IDMtreiber2000congested planner through hierarchical chain-of-thought reasoning with scenario prompts and pre-defined system prompts (including task definition prompts, common sense prompts, and chain-of-thought guidance prompts). In the Reflection module, the planner generated by Reason Engine is simulated and evaluated. Whether to execute or rethink depends on the assessed score.
  • Figure 3: The top of the picture shows the process of constructing a lane map (top right) based on the environment (top left). The white square on the left represents the ego vehicle. The red node on the right indicates the centerline segment where the ego vehicle is located, while nodes of other colors correspond to lane segments of the same color on the left. The bottom of the picture displays the converted text description of the scenario based on the lane-graph, including node relationships and motion states.
  • Figure 4: The detailed example of the system prompt for PlanAgent. It consists of a task definition prompt, a common sense prompt, and a chain-of-thought prompt.
  • Figure 5: The comparison of the NR-CLS metric between our proposed PlanAgent and PDM-Closeddauner2023parting across 14 scenario types based on the nuPlan Test14-hard benchmark.
  • ...and 1 more figures