RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

Xiyan Liu; Han Wang; Yuhu Wang; Junjie Cai; Zhe Cao; Jianzhong Yang; Zhen Lu

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao, Jianzhong Yang, Zhen Lu

TL;DR

RoadSceneBench fills a critical gap by providing a compact benchmark for mid-level road semantics that bridge perception and planning. It introduces six interdependent tasks and a 11,705-image dataset (2,341 clips) from 20 Chinese cities, focused on topology, connectivity, and contextual cues. The MapVLM framework combines supervised fine-tuning with Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) to enforce spatial, relational, and temporal coherence in vision-language models. Empirical results show MapVLM achieving state-of-the-art performance across all tasks, with strong improvements on challenging aspects like ego-lane indexing and lane-change feasibility, underscoring the importance of structure-aware, temporally consistent reasoning for road understanding and map construction.

Abstract

Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

TL;DR

Abstract

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)