Table of Contents
Fetching ...

ViDAS: Vision-based Danger Assessment and Scoring

Pranav Gupta, Advith Krishnan, Naman Nanda, Ananth Eswar, Deeksha Agarwal, Pratham Gohil, Pratyush Goel

TL;DR

This work tackles quantifying danger in video and comparing human judgments to LLM-based evaluations. It presents a novel 100-video dataset with a 0–10 danger metric and precise temporal annotations, plus a multimodal evaluation framework that uses video summaries and prompting strategies (zero-shot, fixed few-shot, and N-shot) to obtain danger ratings from LLMs. Through Mean Squared Error as the alignment metric, the study analyzes how model size and prompting strategy affect agreement with human judgments, revealing that larger models and richer in-context information improve performance, with N-shot prompting showing notable gains. The dataset and methodology establish a standardized benchmark for danger assessment in video, enabling safer content moderation, improved situational awareness, and advances in vision–language understanding of risk with real-world impact.

Abstract

We present a novel dataset aimed at advancing danger analysis and assessment by addressing the challenge of quantifying danger in video content and identifying how human-like a Large Language Model (LLM) evaluator is for the same. This is achieved by compiling a collection of 100 YouTube videos featuring various events. Each video is annotated by human participants who provided danger ratings on a scale from 0 (no danger to humans) to 10 (life-threatening), with precise timestamps indicating moments of heightened danger. Additionally, we leverage LLMs to independently assess the danger levels in these videos using video summaries. We introduce Mean Squared Error (MSE) scores for multimodal meta-evaluation of the alignment between human and LLM danger assessments. Our dataset not only contributes a new resource for danger assessment in video content but also demonstrates the potential of LLMs in achieving human-like evaluations.

ViDAS: Vision-based Danger Assessment and Scoring

TL;DR

This work tackles quantifying danger in video and comparing human judgments to LLM-based evaluations. It presents a novel 100-video dataset with a 0–10 danger metric and precise temporal annotations, plus a multimodal evaluation framework that uses video summaries and prompting strategies (zero-shot, fixed few-shot, and N-shot) to obtain danger ratings from LLMs. Through Mean Squared Error as the alignment metric, the study analyzes how model size and prompting strategy affect agreement with human judgments, revealing that larger models and richer in-context information improve performance, with N-shot prompting showing notable gains. The dataset and methodology establish a standardized benchmark for danger assessment in video, enabling safer content moderation, improved situational awareness, and advances in vision–language understanding of risk with real-world impact.

Abstract

We present a novel dataset aimed at advancing danger analysis and assessment by addressing the challenge of quantifying danger in video content and identifying how human-like a Large Language Model (LLM) evaluator is for the same. This is achieved by compiling a collection of 100 YouTube videos featuring various events. Each video is annotated by human participants who provided danger ratings on a scale from 0 (no danger to humans) to 10 (life-threatening), with precise timestamps indicating moments of heightened danger. Additionally, we leverage LLMs to independently assess the danger levels in these videos using video summaries. We introduce Mean Squared Error (MSE) scores for multimodal meta-evaluation of the alignment between human and LLM danger assessments. Our dataset not only contributes a new resource for danger assessment in video content but also demonstrates the potential of LLMs in achieving human-like evaluations.
Paper Structure (17 sections, 3 equations, 7 figures, 3 tables)

This paper contains 17 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Given a video paired with a score assigned by humans on how dangerous the scene is, we evaluate LLMs on how well they can perceive danger via the score it returns. Above are two examples from our dataset.
  • Figure 2: Plots show (a) distribution of 10 random evaluators' ratings and each video's average ratings. (b) describes the standard deviation of the seven evaluators for all 100 videos
  • Figure 3: Example video summaries. Video (a) is given an average rating of $E_a^{(\text{avg})}$=5 and (b) is given an average rating of $E_b^{(\text{avg})}$=0
  • Figure 4: Marking the danger rating and timeframes of heightened danger using VGG Video Annotator
  • Figure 5: Human Annotation Pipeline
  • ...and 2 more figures