Table of Contents
Fetching ...

CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems

Haichao Liu, Ruoyu Yao, Wenru Liu, Zhenmin Huang, Shaojie Shen, Jun Ma

TL;DR

CoDriveVLM addresses the integration gap in autonomous mobility-on-demand by jointly optimizing dispatching and cooperative motion planning in a high-fidelity, urban setting. It leverages Vision-Language Models to fuse multimodal context (BeV visuals and textual prompts) with graph-based planning and few-shot memory for dynamic decision-making, paired with a VLM-ADMM-hybrid optimization to scale to large CAV fleets. The framework introduces a fast-motion planning loop, an adaptive dispatching trigger, and subgraph-based OCPs solved in parallel, achieving improved responsiveness and safety compared to baselines. Experimental validation in CARLA Town10 demonstrates robust performance across traffic conditions, with ablations underscoring the value of BeV inputs and memory context for reliable inference and planning.

Abstract

The increasing demand for flexible and efficient urban transportation solutions has spotlighted the limitations of traditional Demand Responsive Transport (DRT) systems, particularly in accommodating diverse passenger needs and dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems have emerged as a promising alternative, leveraging connected and autonomous vehicles (CAVs) to provide responsive and adaptable services. However, existing methods primarily focus on either vehicle scheduling or path planning, which often simplify complex urban layouts and neglect the necessity for simultaneous coordination and mutual avoidance among CAVs. This oversimplification poses significant challenges to the deployment of AMoD systems in real-world scenarios. To address these gaps, we propose CoDriveVLM, a novel framework that integrates high-fidelity simultaneous dispatching and cooperative motion planning for future AMoD systems. Our method harnesses Vision-Language Models (VLMs) to enhance multi-modality information processing, and this enables comprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV dispatching coordinator is introduced to effectively manage complex and unforeseen AMoD conditions, thus supporting efficient scheduling decision-making. Furthermore, we propose a scalable decentralized cooperative motion planning method via consensus alternating direction method of multipliers (ADMM) focusing on collision risk evaluation and decentralized trajectory optimization. Simulation results demonstrate the feasibility and robustness of CoDriveVLM in various traffic conditions, showcasing its potential to significantly improve the fidelity and effectiveness of AMoD systems in future urban transportation networks. The code is available at https://github.com/henryhcliu/CoDriveVLM.git.

CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems

TL;DR

CoDriveVLM addresses the integration gap in autonomous mobility-on-demand by jointly optimizing dispatching and cooperative motion planning in a high-fidelity, urban setting. It leverages Vision-Language Models to fuse multimodal context (BeV visuals and textual prompts) with graph-based planning and few-shot memory for dynamic decision-making, paired with a VLM-ADMM-hybrid optimization to scale to large CAV fleets. The framework introduces a fast-motion planning loop, an adaptive dispatching trigger, and subgraph-based OCPs solved in parallel, achieving improved responsiveness and safety compared to baselines. Experimental validation in CARLA Town10 demonstrates robust performance across traffic conditions, with ablations underscoring the value of BeV inputs and memory context for reliable inference and planning.

Abstract

The increasing demand for flexible and efficient urban transportation solutions has spotlighted the limitations of traditional Demand Responsive Transport (DRT) systems, particularly in accommodating diverse passenger needs and dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems have emerged as a promising alternative, leveraging connected and autonomous vehicles (CAVs) to provide responsive and adaptable services. However, existing methods primarily focus on either vehicle scheduling or path planning, which often simplify complex urban layouts and neglect the necessity for simultaneous coordination and mutual avoidance among CAVs. This oversimplification poses significant challenges to the deployment of AMoD systems in real-world scenarios. To address these gaps, we propose CoDriveVLM, a novel framework that integrates high-fidelity simultaneous dispatching and cooperative motion planning for future AMoD systems. Our method harnesses Vision-Language Models (VLMs) to enhance multi-modality information processing, and this enables comprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV dispatching coordinator is introduced to effectively manage complex and unforeseen AMoD conditions, thus supporting efficient scheduling decision-making. Furthermore, we propose a scalable decentralized cooperative motion planning method via consensus alternating direction method of multipliers (ADMM) focusing on collision risk evaluation and decentralized trajectory optimization. Simulation results demonstrate the feasibility and robustness of CoDriveVLM in various traffic conditions, showcasing its potential to significantly improve the fidelity and effectiveness of AMoD systems in future urban transportation networks. The code is available at https://github.com/henryhcliu/CoDriveVLM.git.
Paper Structure (32 sections, 21 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 21 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: The dispatching of the CAVs for the passenger requests from the non-fixed taxi stops. All the requests are gradually assigned to the CAVs for personalized transportation services.
  • Figure 2: Overall Architecture of the Proposed CoDriveVLM. This framework encompasses multiple modules, from multimodal input processing to multifunctional output generation of the VLM. The input module consolidates information from traffic infrastructure, passengers, and CAVs for integration by the subsequent multimodal integrator. Within this framework, the VLM performs two primary functions: determining dispatch orders for CAVs in response to passenger requests, and assessing collision risks between CAV pairs to inform subgraph evolution for cooperative motion planning. Additionally, memory and reflection modules facilitate few-shot learning capabilities.
  • Figure 3: Annotation illustration of the BEV image. This image aggregates the information from the road layout, center lines of the lanes, and the states of the traffic participants. The arrows denote the driving direction of the CAVs, and the numbers above each object are their unique ID to facilitate language-based reasoning and inference.
  • Figure 4: Demonstration of the textual dialogue of the VLM agent. The system message is the abstract of the common knowledge of the environment and the task commands for the AMoD services. The human message enhances the understanding of the corresponding BEV image for the VLM agent by supplementing necessary information. The AI message is the textual response of the VLM agent, which is then stored in the memory module as an important resource for the inference of the VLM agent in a new dialog episode.
  • Figure 5: Illustration of the retrieval process for CoDriveVLM. A BEV image and human message for similarity query are embedded with the pre-trained CLIP encoders, before the pair-wise similarity quantification with the memory vectors. The memory items of Top-$K$ similarities are retrieved to support in-context learning.
  • ...and 7 more figures