Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment
Wenti Yin, Huaxin Zhang, Xiang Wang, Yuqing Lu, Yicheng Zhang, Bingquan Gong, Jialong Zuo, Li Yu, Changxin Gao, Nong Sang
TL;DR
This paper tackles weakly supervised video anomaly detection by addressing two core challenges: incomplete normality modeling and semantic confusion among anomaly categories. It introduces DSANet, which fuses Self-GGuided Normality Modeling (SG-NM) to learn video-specific normal patterns via dynamic normal prototypes and reconstruction, with Decoupled Contrastive Semantic Alignment (DCSA) to disentangle event and background content for fine-grained classification. The approach leverages CLIP-based cross-modal representations, a lightweight Text Adapter, and a unified training objective to achieve superior performance on XD-Violence and UCF-Crime for both coarse- and fine-grained WS-VAD, with extensive ablations validating each component. The results demonstrate improved temporal localization and better inter-class separability, suggesting strong practical impact for scalable, annotation-efficient anomaly detection in videos.
Abstract
Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.
