Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

Yibing Weng; Yu Gu; Fuji Ren

Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

Yibing Weng, Yu Gu, Fuji Ren

TL;DR

Proactively regulating road rage using Vision-Language Models is challenging. The authors define a road rage reasoning task, release a richly annotated dashcam dataset, and benchmark leading VLMs to assess scene understanding, event recognition, and textual reasoning. Findings show substantial gaps in visual-scene comprehension and spatial reasoning in text, even when using manual descriptions to decouple understanding from reasoning. The contributions include an 81-video dataset with 22,226 annotations and a three-task evaluation framework that informs fine-tuning and model development for antecedent-focused regulation in driving. This work lays groundwork for dialog-based calming interventions and safer driving by enabling VLMs to anticipate and mitigate road rage triggers before they escalate.

Abstract

Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.

Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

TL;DR

Abstract

Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)