Table of Contents
Fetching ...

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao, Jianzhong Yang, Zhen Lu

TL;DR

RoadSceneBench fills a critical gap by providing a compact benchmark for mid-level road semantics that bridge perception and planning. It introduces six interdependent tasks and a 11,705-image dataset (2,341 clips) from 20 Chinese cities, focused on topology, connectivity, and contextual cues. The MapVLM framework combines supervised fine-tuning with Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) to enforce spatial, relational, and temporal coherence in vision-language models. Empirical results show MapVLM achieving state-of-the-art performance across all tasks, with strong improvements on challenging aspects like ego-lane indexing and lane-change feasibility, underscoring the importance of structure-aware, temporally consistent reasoning for road understanding and map construction.

Abstract

Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

TL;DR

RoadSceneBench fills a critical gap by providing a compact benchmark for mid-level road semantics that bridge perception and planning. It introduces six interdependent tasks and a 11,705-image dataset (2,341 clips) from 20 Chinese cities, focused on topology, connectivity, and contextual cues. The MapVLM framework combines supervised fine-tuning with Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) to enforce spatial, relational, and temporal coherence in vision-language models. Empirical results show MapVLM achieving state-of-the-art performance across all tasks, with strong improvements on challenging aspects like ego-lane indexing and lane-change feasibility, underscoring the importance of structure-aware, temporally consistent reasoning for road understanding and map construction.

Abstract

Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of RoadSceneBench. The benchmark spans Scene-, Relational-, and Semantic-level tasks with multi-frame reasoning. Furthermore, we benchmark various open- and closed-source models.
  • Figure 2: Visualization of representative annotation types in RoadSceneBench. All examples highlight the mid-level semantics connecting perception and structural reasoning.
  • Figure 3: Framework of dataset construction and MapVLM.
  • Figure 4: Visualization of representative cases from RoadSceneBench that highlight the complexities of real-world driving scenarios. Red text indicates prediction errors, underscoring both the benchmark's difficulty and the superior reasoning of our model.
  • Figure 5: Comparison of SFT and SFT+HRRP-T on a 5-frame congested urban scene. The ego vehicle stays in the same lane: the first two frames clearly show a five-lane layout, whereas the last three frames are partially occluded. SFT reacts to these ambiguous observations with frame-wise drift in lane count and ego-lane index. SFT+HRRP-T leverages temporal evidence and preserves consistent ego-lane predictions with a coherent five-lane topology.
  • ...and 1 more figures