Table of Contents
Fetching ...

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, Xiaofeng Tao

TL;DR

CUVA tackles the practical need for understanding causation in video anomalies by defining three interrelated tasks (What, Why, How) and delivering a large, richly annotated dataset. It introduces MMEval, a multimodal evaluation metric that aligns model judgments with human preferences, and Anomaly Guardian, a prompt-based baseline that combines hard and soft prompts to extract key cues and construct a cause–effect reasoning chain. Extensive experiments demonstrate MMEval’s superiority over traditional metrics and show that A-Guardian improves performance on description and causal tasks, providing a robust benchmark for future VLM-based anomaly understanding. The work holds practical significance for domains like traffic surveillance and industrial monitoring by enabling more interpretable and causally grounded video understanding beyond mere anomaly detection.

Abstract

Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA.

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

TL;DR

CUVA tackles the practical need for understanding causation in video anomalies by defining three interrelated tasks (What, Why, How) and delivering a large, richly annotated dataset. It introduces MMEval, a multimodal evaluation metric that aligns model judgments with human preferences, and Anomaly Guardian, a prompt-based baseline that combines hard and soft prompts to extract key cues and construct a cause–effect reasoning chain. Extensive experiments demonstrate MMEval’s superiority over traditional metrics and show that A-Guardian improves performance on description and causal tasks, providing a robust benchmark for future VLM-based anomaly understanding. The work holds practical significance for domains like traffic surveillance and industrial monitoring by enabling more interpretable and causally grounded video understanding beyond mere anomaly detection.

Abstract

Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA.
Paper Structure (34 sections, 5 equations, 16 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of causations of video anomaly. The clip started at Frame D refers to a traffic accident, which was caused by the event indicated with Frame B 7 seconds before. The clip in Frame F shows the effect of such an anomaly. A model needs to understand such a long-range relation in the video to yield correct text-based explanations.
  • Figure 2: Overview of the proposed CUVA benchmark. Our CUVA benchmark consists of manual text-based annotation, including detailed explanations of cause (Why) and effect (Why), anomaly types (What), detailed event descriptions (What), as well as importance scores that can form a curve of events (How).
  • Figure 3: Pipeline of generating an importance curve. Annotators need to consider previous tasks (e.g., Logical Description, Moment Description) and video content to create $3$ to $6$ short sentences ${T_{i}}$ describing all events in the video. We rank these sentences' anomaly severity by ChatGPT chatgpt and obtain anomaly scores $s$. Simultaneously, we sample frames ${f_{t}}$ from the video and use CLIP clip to measure the similarity between sentences and frames. The resulting similarity scores are multiplied by the anomaly scores for each sentence to get $value_{t}$ for each frame.
  • Figure 4: Statistics of our CUVA dataset. Figure (a) shows all anomaly types in CUVA. Figure (b) and (c) show the number of videos in each anomaly type. Figure (d) shows the distribution of video length. Figure (e) shows the temporal distribution of anomalous segments.
  • Figure 5: Architecture of the proposed prompt-based method A-Guardian.
  • ...and 11 more figures