Table of Contents
Fetching ...

Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction

Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, Hao Zhao

TL;DR

Chameleon addresses lane topology extraction for mapless autonomous driving by integrating fast, VLM-driven program synthesis with a slow, dense-prompting VLM for corner cases. The method defines lane-to-lane and lane-to-element adjacencies and uses a fast-slow architecture to balance efficiency and accuracy, aided by a chain-of-thought reasoning process and a suite of VQA tasks. It introduces API, expert-rule, few-shot, and VQA prompts to tailor code generation and reasoning, and demonstrates competitive performance on OpenLane-V2 with favorable latency compared to dense prompting. The work provides a practical, few-shot approach that leverages visual inputs in symbolic reasoning, and releases data, code, and models for benchmarking in autonomous driving topology understanding.

Abstract

Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic

Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction

TL;DR

Chameleon addresses lane topology extraction for mapless autonomous driving by integrating fast, VLM-driven program synthesis with a slow, dense-prompting VLM for corner cases. The method defines lane-to-lane and lane-to-element adjacencies and uses a fast-slow architecture to balance efficiency and accuracy, aided by a chain-of-thought reasoning process and a suite of VQA tasks. It introduces API, expert-rule, few-shot, and VQA prompts to tailor code generation and reasoning, and demonstrates competitive performance on OpenLane-V2 with favorable latency compared to dense prompting. The work provides a practical, few-shot approach that leverages visual inputs in symbolic reasoning, and releases data, code, and models for benchmarking in autonomous driving topology understanding.

Abstract

Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic

Paper Structure

This paper contains 13 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: VLMs cannot directly address complex 3D scene understanding tasks, such as lane topology extraction. (a) One possible approach is to use dense visual prompting, as in shtedritski2023does, which is accurate but inefficient. (b) Another approach is Neuro-symbolic reasoning, as in hsu2023ns3d, which does not effectively leverage visual inputs for program synthesis. (c) Our proposed Chameleon method employs a fast-slow design, where one VLM synthesizes programs and another handles corner cases.
  • Figure 2: Overview of Chameleon. Given multi-view images as input, the vision models first generate traffic lanes and traffic elements, respectively. The proposed fast system leverages a large Vision Language Model which takes predefined visual-textual few-shot samples and text prompts as inputs, and generates executable codes to process the predictions by the vision models. The proposed slow system consists of a VQA API Set and a Vision Language Model with Chain-of-thought reasoning, where vision prompts and text prompts within the VQA API Set are the inputs of VLM. Subsequently, the topology reasoning results are an combination of code execution results and VLM outputs.
  • Figure 3: Illustration of Chameleon architecture. Given multi-view images and text prompt as input, Chameleon achieves lane topology extraction. Each API or dense visual prompting VQA task is represented as a node. COT VLM adaptively selects the nodes which are executed to infer the topology results based on input.
  • Figure 4: Qualitative results of TopoMLP and Chameleon(ours) on Openlane-V2 wang2024openlane validation dataset. (a) The vehicle has just passed the intersection. (b) There is a left-turn traffic light ahead. (c) The ground lane is marked with a straight-ahead sign. (d) The vehicle is on a one-way right-turn lane. The selected scenes are all corner cases and have undergo further reasoning through dense visual prompting.