Table of Contents
Fetching ...

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

Chirag Parikh, Deepti Rawat, Rakshitha R. T., Tathagata Ghosh, Ravi Kiran Sarvadevabhatla

TL;DR

RoadSocial tackles the lack of global diversity in road event understanding by introducing a large-scale, social-media-driven VideoQA dataset. It employs a scalable semi-automatic annotation pipeline that fuses video and text LLMs to generate a rich set of QA pairs across 12 tasks, including challenging adversarial and incompatible questions to probe hallucination robustness. The dataset spans 14M frames from 13.2K videos with 260K QA pairs and 674 tags, enabling thorough evaluation of 18 Video LLMs and showing that fine-tuning general-purpose models benefits road-event understanding. This resource advances cross-viewpoint, cross-geography road understanding and provides a realistic benchmark for robustness, bias awareness, and practical deployment in intelligent transportation systems.

Abstract

We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

TL;DR

RoadSocial tackles the lack of global diversity in road event understanding by introducing a large-scale, social-media-driven VideoQA dataset. It employs a scalable semi-automatic annotation pipeline that fuses video and text LLMs to generate a rich set of QA pairs across 12 tasks, including challenging adversarial and incompatible questions to probe hallucination robustness. The dataset spans 14M frames from 13.2K videos with 260K QA pairs and 674 tags, enabling thorough evaluation of 18 Video LLMs and showing that fine-tuning general-purpose models benefits road-event understanding. This resource advances cross-viewpoint, cross-geography road understanding and provides a realistic benchmark for robustness, bias awareness, and practical deployment in intelligent transportation systems.

Abstract

We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.

Paper Structure

This paper contains 31 sections, 72 figures, 4 tables.

Figures (72)

  • Figure 1: Diverse Video Attributes in the RoadSocial Dataset: The total count of unique tags for each attribute is shown in [inner color=black, outer color=black, fill color=white]circled boxes, alongside word clouds highlighting these values. For each attribute, we display examples with 2-3 keyframes from videos. The figure captures the diversity of road events, environmental conditions, geographical locations, viewpoints, interactions between road entities, and traffic violations. The varied scenarios under each attribute showcase the rich complexity of our dataset.
  • Figure 2: RoadSocial Annotation Pipeline: The steps involved in the annotation pipeline are depicted from 1 to 8. Raw Tweet Data consists of the video and the Twitter conversation. Step 1 includes splitting the video into 3-second segments (in purple shaded boxes). Step 2 involves feeding the video segments to Video LLM and prompting it to generate corresponding captions numbered from 1 to N. These captions are aggregated and summarized by an LLM to generate entire video summary in Step 3. Step 4 filters the raw tweet textual data and extracts the captions, replies, hashtags, and tagged legal authorities' user handles (highlighted in blue). This filtered conversation data and the entire video visual summary are fed to LLM and prompted to generate generic () and specific () QA pairs in Step 5. All important aspects of the key road event mentioned in the raw tweet text, video segment captions, the entire video summary, and the generated QA pairs are highlighted in purple. The generated QA pairs are refined and categorized into pre-defined tasks in step 6. These QA pairs are verified by expert annotators to either include or exclude them from the dataset in Step 7. The human-verified QA pairs are then used as input to generate video-level tags in Step 8.
  • Figure 3: Examples of QA Pairs grouped by tasks and color-coded by task category. Gray outlined questions are generic while gray fill shading indicates specific questions. Highlighted text indicates key information. (\ref{['sec:qa_annotation']}).
  • Figure 4: The diversity of RoadSocial dataset: The number of QA pairs, social commentary (tweets), and video frames spread across different regions is shown. Overall statistics of the raw tweet data, generated QA pairs, and tags in our dataset is also shown. Total incompatible QA pairs and related numbers for non-road event videos are specified inside a light brown box at left.
  • Figure 5: QA Task Taxonomy: The QA pairs in RoadSocial are broadly grouped into 4 categories (highlighted in blue) which are further subdivided into 12 tasks (shown in green). Total QA pair count for each category is shown in blue squared box. Some of these tasks are further subdivided into granular sub-tasks (highlighted in orange) to facilitate coarse to fine-grained understanding of road events along different aspects.
  • ...and 67 more figures