Table of Contents
Fetching ...

Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

Hang Du, Guoshun Nan, Jiawen Qian, Wangchenhui Wu, Wendi Deng, Hanqing Mu, Zhenyan Chen, Pengxuan Mao, Xiaofeng Tao, Jun Liu

TL;DR

This paper introduces ECVA, the first large-scale benchmark focused on causation understanding in video anomalies, addressing What happened, Why it happened, and How severe it was. It provides long real-world videos with rich What/Why/How annotations and introduces AnomShield, a prompt-based baseline, and AnomEval, a causation-oriented evaluation metric. Experimental results show AnomShield achieving competitive results on causal tasks and AnomEval aligning closely with human judgments, highlighting the need for causation-aware evaluation. The work enables development of causation-aware video-language models for real-world applications in surveillance, safety, and automated monitoring of complex environments.

Abstract

Recent advancements in video anomaly understanding (VAU) have opened the door to groundbreaking applications in various fields, such as traffic monitoring and industrial automation. While the current benchmarks in VAU predominantly emphasize the detection and localization of anomalies. Here, we endeavor to delve deeper into the practical aspects of VAU by addressing the essential questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we introduce a comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA). Our benchmark is meticulously designed, with each video accompanied by detailed human annotations. Specifically, each instance of our ECVA involves three sets of human annotations to indicate "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. Building upon this foundation, we propose a novel prompt-based methodology that serves as a baseline for tackling the intricate challenges posed by ECVA. We utilize "hard prompt" to guide the model to focus on the critical parts related to video anomaly segments, and "soft prompt" to establish temporal and spatial relationships within these anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA. This metric leverages the unique features of the ECVA dataset to provide a more comprehensive and reliable assessment of various video large language models. We demonstrate the efficacy of our approach through rigorous experimental analysis and delineate possible avenues for further investigation into the comprehension of video anomaly causation.

Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

TL;DR

This paper introduces ECVA, the first large-scale benchmark focused on causation understanding in video anomalies, addressing What happened, Why it happened, and How severe it was. It provides long real-world videos with rich What/Why/How annotations and introduces AnomShield, a prompt-based baseline, and AnomEval, a causation-oriented evaluation metric. Experimental results show AnomShield achieving competitive results on causal tasks and AnomEval aligning closely with human judgments, highlighting the need for causation-aware evaluation. The work enables development of causation-aware video-language models for real-world applications in surveillance, safety, and automated monitoring of complex environments.

Abstract

Recent advancements in video anomaly understanding (VAU) have opened the door to groundbreaking applications in various fields, such as traffic monitoring and industrial automation. While the current benchmarks in VAU predominantly emphasize the detection and localization of anomalies. Here, we endeavor to delve deeper into the practical aspects of VAU by addressing the essential questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we introduce a comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA). Our benchmark is meticulously designed, with each video accompanied by detailed human annotations. Specifically, each instance of our ECVA involves three sets of human annotations to indicate "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. Building upon this foundation, we propose a novel prompt-based methodology that serves as a baseline for tackling the intricate challenges posed by ECVA. We utilize "hard prompt" to guide the model to focus on the critical parts related to video anomaly segments, and "soft prompt" to establish temporal and spatial relationships within these anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA. This metric leverages the unique features of the ECVA dataset to provide a more comprehensive and reliable assessment of various video large language models. We demonstrate the efficacy of our approach through rigorous experimental analysis and delineate possible avenues for further investigation into the comprehension of video anomaly causation.

Paper Structure

This paper contains 26 sections, 12 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of causations of video anomaly. The clip started at Frame B refers to a traffic accident, which was caused by the event indicated with Frame A 7 seconds before. The clip in Frame C shows the effect of such an anomaly. A model needs to understand such a long-range relation in the video to yield correct text-based explanations.
  • Figure 2: Overview of the proposed ECVA benchmark. Our ECVA benchmark consists of manual text-based annotation, including detailed explanations of cause (Why) and effect (Why), anomaly types (What), detailed event descriptions (What), as well as importance scores that can form a curve of events (How).
  • Figure 3: Pipeline of generating an importance curve. Annotators need to consider previous tasks (e.g., Logical Description, Moment Description) and video content to create $3$ to $6$ short sentences ${T_{i}}$ describing all events in the video. We rank these sentences' anomaly severity by Chat GPT chatgpt and obtain anomaly scores $s$. Simultaneously, we sample frames ${f_{t}}$ from the video and use CLIP clip to measure the similarity between sentences and frames. The resulting similarity scores are multiplied by the anomaly scores for each sentence to get $value_{t}$ for each frame.
  • Figure 4: Statistics of our ECVA dataset. Figure (a) shows all anomaly types in ECVA. Figure (b) shows the number of videos in each anomaly type. Figure (c) shows the distribution of video length. Figure (d) shows the distribution of anomaly segment duration. Figure (e) shows the temporal distribution of anomalous segments.
  • Figure 5: The architecture of our AnomShield. We first conduct sparse uniform sampling for each video①, and then apply AnomShield (optimized through a two-stage training process) to generate descriptions for the sampled frames②. Next, we identify key frames in the video by a matching strategy③ and conduct dense sampling around these key frames to capture the essential segments of the video④. For these key segments, we add spatial-temporal position embedding to each frame⑤ and leverage a bidirectional Mamba-based method to extract their spatio-temporal relationship⑥⑦. Then, we use an MLP to align text and image features⑧, and finally feed text-image feature into the base LLM to get the answer.
  • ...and 4 more figures