Table of Contents
Fetching ...

From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

Krishna Kanth Nakka, Vedasri Nakka

TL;DR

Cyclists face safety-critical decisions in urban traffic, and existing autonomous-driving VLMs largely evaluated from a vehicle-centric perspective may not transfer to cyclist-assistive perception. The authors introduce CyclingVQA, a cyclist-perspective benchmark with 2,009 QA pairs derived from 695 real-world Munich images, covering eight tasks that probe traffic-sign grounding, spatial/temporal reasoning, lane associations, and sign–action relations. A broad evaluation across generalist, spatially enhanced, driving-specialized, and frontier models reveals that generalist VLMs often outperform driving-focused ones, with instruction-tuning yielding more robust results than reasoning-focused prompts; temporal and spatial tasks remain challenging. The work highlights systematic failure modes and prompts targeted improvements for cyclist-centric intelligent systems, providing a foundation for safer, more effective cyclist assistance in urban environments.

Abstract

Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

TL;DR

Cyclists face safety-critical decisions in urban traffic, and existing autonomous-driving VLMs largely evaluated from a vehicle-centric perspective may not transfer to cyclist-assistive perception. The authors introduce CyclingVQA, a cyclist-perspective benchmark with 2,009 QA pairs derived from 695 real-world Munich images, covering eight tasks that probe traffic-sign grounding, spatial/temporal reasoning, lane associations, and sign–action relations. A broad evaluation across generalist, spatially enhanced, driving-specialized, and frontier models reveals that generalist VLMs often outperform driving-focused ones, with instruction-tuning yielding more robust results than reasoning-focused prompts; temporal and spatial tasks remain challenging. The work highlights systematic failure modes and prompts targeted improvements for cyclist-centric intelligent systems, providing a foundation for safer, more effective cyclist assistance in urban environments.

Abstract

Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.
Paper Structure (23 sections, 7 figures, 27 tables)

This paper contains 23 sections, 7 figures, 27 tables.

Figures (7)

  • Figure 1: Comparison between vehicle-centric driving benchmarks corbiere2025retrievaltian2025nuscenesli2025fine, which predominantly focus on road-level perspectives, and our cyclist-centric viewpoint, highlighting differences in camera perspective and the presence of cycling-specific traffic signage. See Appendix \ref{['sec:qualitativeresults']} for further examples from our dataset.
  • Figure 1: Summary of CyclingVQA tasks.
  • Figure 2: Benchmark tasks. Illustration of the eight benchmark tasks in CyclingVQA, showing example question prompts together with visual inputs augmented by lane annotations and bounding-box supervision.
  • Figure 3: Overview of our annotation pipeline.
  • Figure 4: CoT vs. Standard Prompting. Overall performance degrades under CoT prompting across the three instruct models.
  • ...and 2 more figures