Table of Contents
Fetching ...

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei

TL;DR

This work addresses the lack of holistic zero-shot video anomaly analysis by proposing a training-free, unified framework that links temporal detection, spatial localization, and textual explanation through test-time, chained reasoning. It introduces two components, Intra-Task Reasoning (IntraTR) for temporal anomaly detection and Inter-Task Chaining (InterTC) to propagate priors from VAD to VAL and VAU, using frozen vision-language backbones with dynamically refined prompts. Across four benchmarks (UCF-Crime, XD-Violence, UBnormal, MSAD), the approach achieves state-of-the-art zero-shot performance on VAD, VAL, and VAU, with consistent gains from incorporating anomaly priors and selective test-time reasoning while maintaining efficiency. The framework enhances interpretability and robustness in video anomaly analysis without any additional training, though it hinges on the capabilities and biases of the underlying multimodal models and invites consideration of privacy and ethical implications in deployment.

Abstract

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

TL;DR

This work addresses the lack of holistic zero-shot video anomaly analysis by proposing a training-free, unified framework that links temporal detection, spatial localization, and textual explanation through test-time, chained reasoning. It introduces two components, Intra-Task Reasoning (IntraTR) for temporal anomaly detection and Inter-Task Chaining (InterTC) to propagate priors from VAD to VAL and VAU, using frozen vision-language backbones with dynamically refined prompts. Across four benchmarks (UCF-Crime, XD-Violence, UBnormal, MSAD), the approach achieves state-of-the-art zero-shot performance on VAD, VAL, and VAU, with consistent gains from incorporating anomaly priors and selective test-time reasoning while maintaining efficiency. The framework enhances interpretability and robustness in video anomaly analysis without any additional training, though it hinges on the capabilities and biases of the underlying multimodal models and invites consideration of privacy and ethical implications in deployment.

Abstract

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.

Paper Structure

This paper contains 49 sections, 10 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the unified holistic anomaly analysis framework.Left: A preliminary step extracting the most suspicious intervals of a video and extracts anomaly tag lists reflecting possible anomaly contexts. Right: Illustraction of how the priors are used to refine each of the tasks. Low-confidence samples in Temporal VAD are refined by a selective Intra-Task Reasoning step. The Inter-Task Chaining further connects it to downstream, including spatial VAL and textual VAU into a cascaded chain for a unified holistic anomaly analysis.
  • Figure 2: Intra-Task Reasoning pipeline: (1) the Initial Scorer produces a score curve; (2) peak detection truncates a suspicious window and the Tag Extractor generates anomaly tags $t_V$; (3) a reasoning gate refines low-confidence predictions via the Score Updater.
  • Figure 3: Anomaly scores on a video from UCF-Crime with an "Arrest" incident.
  • Figure 4: Qualitative results of video anomaly understanding. Descriptions for a video containing an incident of "Shoplifting" from different methods, where green text highlights correct descriptions/rationale about the anomaly, and red highlights statements inconsistent with the ground truth.
  • Figure 5: $\Delta$ of Score density with regards to distance to decision boundary. For all samples in UCF-Crime and XD-Violence, it is shown that high $m$ value resulted in more ambiguious predictions with $|\tilde{S}_V - \tau| \rightarrow 0$ while a small or local variance based $m$ effectively pushes the predictions away from decision boundary as we expected.
  • ...and 7 more figures