Table of Contents
Fetching ...

AD-H: Autonomous Driving with Hierarchical Agents

Zaibin Zhang, Shiyu Tang, Yuanhang Zhang, Talas Fu, Yifan Wang, Yang Liu, Dong Wang, Jing Shao, Lijun Wang, Huchuan Lu

TL;DR

AD-H introduces a hierarchical driving framework that separates high-level planning from low-level control by inserting mid-level language-driven commands between a MLLM planner and a lightweight controller. This decoupling enables the MLLM to leverage its emergent reasoning and world knowledge for robust planning while the controller translates mid-level commands into actionable waypoints, improving generalization to long-horizon instructions and novel environments. A new LMDrive-H dataset with 1.753M hierarchically annotated frames enables effective training of the planner and controller. Empirical results in CARLA show AD-H surpasses state-of-the-art end-to-end language-guided driving, demonstrates self-correction, and exhibits stronger generalization, signaling practical benefits for scalable, instruction-grounded autonomous navigation.

Abstract

Due to the impressive capabilities of multimodal large language models (MLLMs), recent works have focused on employing MLLM-based agents for autonomous driving in large-scale and dynamic environments. However, prevalent approaches often directly translate high-level instructions into low-level vehicle control signals, which deviates from the inherent language generation paradigm of MLLMs and fails to fully harness their emergent powers. As a result, the generalizability of these methods is highly restricted by autonomous driving datasets used during fine-tuning. To tackle this challenge, we propose to connect high-level instructions and low-level control signals with mid-level language-driven commands, which are more fine-grained than high-level instructions but more universal and explainable than control signals, and thus can effectively bridge the gap in between. We implement this idea through a hierarchical multi-agent driving system named AD-H, including a MLLM planner for high-level reasoning and a lightweight controller for low-level execution. The hierarchical design liberates the MLLM from low-level control signal decoding and therefore fully releases their emergent capability in high-level perception, reasoning, and planning. We build a new dataset with action hierarchy annotations. Comprehensive closed-loop evaluations demonstrate several key advantages of our proposed AD-H system. First, AD-H can notably outperform state-of-the-art methods in achieving exceptional driving performance, even exhibiting self-correction capabilities during vehicle operation, a scenario not encountered in the training dataset. Second, AD-H demonstrates superior generalization under long-horizon instructions and novel environmental conditions, significantly surpassing current state-of-the-art methods. We will make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H

AD-H: Autonomous Driving with Hierarchical Agents

TL;DR

AD-H introduces a hierarchical driving framework that separates high-level planning from low-level control by inserting mid-level language-driven commands between a MLLM planner and a lightweight controller. This decoupling enables the MLLM to leverage its emergent reasoning and world knowledge for robust planning while the controller translates mid-level commands into actionable waypoints, improving generalization to long-horizon instructions and novel environments. A new LMDrive-H dataset with 1.753M hierarchically annotated frames enables effective training of the planner and controller. Empirical results in CARLA show AD-H surpasses state-of-the-art end-to-end language-guided driving, demonstrates self-correction, and exhibits stronger generalization, signaling practical benefits for scalable, instruction-grounded autonomous navigation.

Abstract

Due to the impressive capabilities of multimodal large language models (MLLMs), recent works have focused on employing MLLM-based agents for autonomous driving in large-scale and dynamic environments. However, prevalent approaches often directly translate high-level instructions into low-level vehicle control signals, which deviates from the inherent language generation paradigm of MLLMs and fails to fully harness their emergent powers. As a result, the generalizability of these methods is highly restricted by autonomous driving datasets used during fine-tuning. To tackle this challenge, we propose to connect high-level instructions and low-level control signals with mid-level language-driven commands, which are more fine-grained than high-level instructions but more universal and explainable than control signals, and thus can effectively bridge the gap in between. We implement this idea through a hierarchical multi-agent driving system named AD-H, including a MLLM planner for high-level reasoning and a lightweight controller for low-level execution. The hierarchical design liberates the MLLM from low-level control signal decoding and therefore fully releases their emergent capability in high-level perception, reasoning, and planning. We build a new dataset with action hierarchy annotations. Comprehensive closed-loop evaluations demonstrate several key advantages of our proposed AD-H system. First, AD-H can notably outperform state-of-the-art methods in achieving exceptional driving performance, even exhibiting self-correction capabilities during vehicle operation, a scenario not encountered in the training dataset. Second, AD-H demonstrates superior generalization under long-horizon instructions and novel environmental conditions, significantly surpassing current state-of-the-art methods. We will make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H
Paper Structure (45 sections, 3 equations, 5 figures, 10 tables)

This paper contains 45 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison between the previous method and AD-H in an oversteering scenario not encountered during training. The previous method keeps moving straight, deviating from the intended route. Conversely, the planner of AD-H can provide corrective mid-level commands to the controller, facilitating the vehicle's re-alignment.
  • Figure 2: (a) Pipeline of AD-H. The planner breaks down a high-level instruction into mid-level driving commands and the controller decodes low-level waypoints from the mid-level commands. (b) Examples of a high-level instruction, a mid-level command, and low-level waypoints.
  • Figure 3: Results of self-correction scenario. (a) High-level instruction; (b) Visualization results of LMDrive; (c) Visualization results of AD-H; (d) Mid-level driving commands predicted by the planner of AD-H. The visual results show that LMDrive maintains a straight trajectory after oversteering, deviating from the intended path. However, AD-H is able to issue precise commands to guide the vehicle back on track.
  • Figure 4: Results with long-horizon instructions (a). (b) LMDrive persists in following the initial instructions, continuing forward; (c) AD-H can adeptly assess environmental cues to determine the appropriate timing for turning; (d) Mid-level commands produced by AD-H.
  • Figure 5: Dataset Generation Pipeline.