Table of Contents
Fetching ...

LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking

Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, Yong Liu

TL;DR

LeapVAD addresses the limitations of data-driven autonomous driving by introducing a cognitive perception pipeline and a dual-process decision system inspired by System-I and System-II. A Vision-Language Model-based scene understanding module produces a Scene Token t ∈ ℝ^{B×256} that encodes driving-relevant object attributes, while a Scene Encoder and a MoCo-style memory dictionary enable efficient retrieval for few-shot adaptation via ACT/ACC spaces. The Analytic Process (slow, LLM-powered reasoning) applies world knowledge and traffic rules, complemented by a fast Heuristic Process distilled from it and guided by few-shot retrieval from a growing memory bank (size up to thousands of tokens). Closed-loop experiments in CARLA Town05 and DriveArena show LeapVAD achieving strong driving scores with far less data than competitive baselines, plus robust cross-domain transfer, highlighting the practicality and adaptability of the knowledge-driven approach with continuous learning.

Abstract

While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/.

LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking

TL;DR

LeapVAD addresses the limitations of data-driven autonomous driving by introducing a cognitive perception pipeline and a dual-process decision system inspired by System-I and System-II. A Vision-Language Model-based scene understanding module produces a Scene Token t ∈ ℝ^{B×256} that encodes driving-relevant object attributes, while a Scene Encoder and a MoCo-style memory dictionary enable efficient retrieval for few-shot adaptation via ACT/ACC spaces. The Analytic Process (slow, LLM-powered reasoning) applies world knowledge and traffic rules, complemented by a fast Heuristic Process distilled from it and guided by few-shot retrieval from a growing memory bank (size up to thousands of tokens). Closed-loop experiments in CARLA Town05 and DriveArena show LeapVAD achieving strong driving scores with far less data than competitive baselines, plus robust cross-domain transfer, highlighting the practicality and adaptability of the knowledge-driven approach with continuous learning.

Abstract

While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/.
Paper Structure (42 sections, 5 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 5 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: The architecture of LeapVAD consists of two primary modules: scene understanding and dual-process decision-making. The scene understanding module analyzes multi-view or multi-frame images, identifying critical objects and generating a scene token. This token serves as a characteristic representation of the current scene. The dual-process decision-making module then uses this scene description and the guidance of traffic rules to make reasoning and decisions. These decisions are converted into control signals to navigate the ego car in the simulator. Specifically, Analytic Process accumulates an initial memory bank used to train Heuristic Process and updates it, especially when Heuristic Process encounters accidents. Heuristic Process leverages scene tokens to efficiently retrieve the most relevant historical scenarios from this memory bank, enabling rapid and informed driving decisions.
  • Figure 2: We create a dataset for instruction learning in VLM derived from DriveLM sima2023drivelm, Rank2Tell sachdeva2024rank2tell, and CARLA dosovitskiy2017carla. This dataset can be categorized into two types: multi-view and multi-frame. The multi-view annotations include a summary and elaboration, while the multi-frame annotations solely consist of a summary. Compared to multi-view annotations, the multi-frame annotations provide additional information such as exact velocity and motion trends.
  • Figure 3: The training pipeline of our Scene Encoder is outlined as follows: (a) provides details about the input data; (b) illustrates how we form both ACT and ACC for the input images and update the model using contrastive loss in these two spaces; (c) presents the architecture of the Scene Encoder.
  • Figure 4: The illustration depicts the fine-tuning process. Figure (a) illustrates the fine-tuning of the VLM using 4.1K instruction-following data points for scene understanding. Figure (b) shows the utilization of the collected samples in the memory bank to fine-tune Qwen-1.5, employed in the Heuristic Process model.
  • Figure 5: Precision-Recall curves on nuScenes caesar2020nuscenes dataset
  • ...and 10 more figures