Table of Contents
Fetching ...

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao

TL;DR

It is found that spatial reasoning remains a fundamental bottleneck for current VLMs, and the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

Abstract

Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some spatial questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

TL;DR

It is found that spatial reasoning remains a fundamental bottleneck for current VLMs, and the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

Abstract

Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some spatial questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

Paper Structure

This paper contains 6 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Benchmark framework for evaluating the lane topology awareness capability of VLMs. BEV features are generated from multi-view images, different models are selected for inference, answers and results for four different subtasks are output, and the model performance on different tasks and different models are compared and evaluated through radar charts combined with user questions.
  • Figure 2: Radar charts of model performance on topology awareness tasks grouped by parameter scale. These radar charts illustrate the performance of different Vision-Language Models on the TopoAware-Bench, organized by parameter scale. Tasks include Connection (Conn), LeftRight (LR), Intersection (Area), and Vector (Vec), with average performance (Avg) also shown. Subgraphs (a) to (e) compare models in different parameter ranges, including open source and closed source VLMs.
  • Figure 3: Performance comparison of the same family of models with different parameter scales on the TopoAware-Bench task. The three parameter specifications of InternVL3 and Qwen2.5-VL series models are compared, and five evaluation dimensions are corresponding to different colors.
  • Figure 4: Scatter plot showing the relationship between model size (in billions of parameters) and average accuracy (%) on the TopoAware-Bench. Different model series are represented with distinct markers.
  • Figure 5: The influence of test time scaling measured by thinking tokens and few shot samples. The left vertical axis represents the number of tokens, the right vertical axis represents the accuracy, and the horizontal axis represents the three methods baseline, TTS, and TTS + few shots in turn. Different color bars correspond to the number of Tokens in four tasks. The marked broken lines represent the accuracy of each of the four categories of tasks.