ViDAS: Vision-based Danger Assessment and Scoring
Pranav Gupta, Advith Krishnan, Naman Nanda, Ananth Eswar, Deeksha Agarwal, Pratham Gohil, Pratyush Goel
TL;DR
This work tackles quantifying danger in video and comparing human judgments to LLM-based evaluations. It presents a novel 100-video dataset with a 0–10 danger metric and precise temporal annotations, plus a multimodal evaluation framework that uses video summaries and prompting strategies (zero-shot, fixed few-shot, and N-shot) to obtain danger ratings from LLMs. Through Mean Squared Error as the alignment metric, the study analyzes how model size and prompting strategy affect agreement with human judgments, revealing that larger models and richer in-context information improve performance, with N-shot prompting showing notable gains. The dataset and methodology establish a standardized benchmark for danger assessment in video, enabling safer content moderation, improved situational awareness, and advances in vision–language understanding of risk with real-world impact.
Abstract
We present a novel dataset aimed at advancing danger analysis and assessment by addressing the challenge of quantifying danger in video content and identifying how human-like a Large Language Model (LLM) evaluator is for the same. This is achieved by compiling a collection of 100 YouTube videos featuring various events. Each video is annotated by human participants who provided danger ratings on a scale from 0 (no danger to humans) to 10 (life-threatening), with precise timestamps indicating moments of heightened danger. Additionally, we leverage LLMs to independently assess the danger levels in these videos using video summaries. We introduce Mean Squared Error (MSE) scores for multimodal meta-evaluation of the alignment between human and LLM danger assessments. Our dataset not only contributes a new resource for danger assessment in video content but also demonstrates the potential of LLMs in achieving human-like evaluations.
