A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis
Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei
TL;DR
This work addresses the lack of holistic zero-shot video anomaly analysis by proposing a training-free, unified framework that links temporal detection, spatial localization, and textual explanation through test-time, chained reasoning. It introduces two components, Intra-Task Reasoning (IntraTR) for temporal anomaly detection and Inter-Task Chaining (InterTC) to propagate priors from VAD to VAL and VAU, using frozen vision-language backbones with dynamically refined prompts. Across four benchmarks (UCF-Crime, XD-Violence, UBnormal, MSAD), the approach achieves state-of-the-art zero-shot performance on VAD, VAL, and VAU, with consistent gains from incorporating anomaly priors and selective test-time reasoning while maintaining efficiency. The framework enhances interpretability and robustness in video anomaly analysis without any additional training, though it hinges on the capabilities and biases of the underlying multimodal models and invites consideration of privacy and ethical implications in deployment.
Abstract
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.
