CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
Yuankai He, Weisong Shi
TL;DR
CARScenes addresses the need for interpretable scene-level supervision in driving-focused vision-language modeling by introducing a fixed 28-key knowledge base with 350+ attributes and a discrete severity scale of $1$--$10$, applied to 5,192 frames from Argoverse1, Cityscapes, KITTI, and nuScenes. It uses a GPT-4o-assisted labeling pipeline with human verification to produce deterministic, schema-faithful JSONL annotations and represents each frame as an attribute co-occurrence graph for semantic querying and data curation. The paper provides a reproducible baseline with a LoRA-tuned Qwen2-VL-2B-Instruct model and multiple evaluation metrics (scalar accuracy, micro-F1 for lists, severity MAE/RMSE) to calibrate task difficulty and benchmark VLM performance. Overall, CARScenes enables structured field training, content-aware data triage, cross-dataset transfer under a unified ontology, and risk-aware scenario mining for future intelligent vehicles, all without relying on video or simulator data.
Abstract
CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes
