Table of Contents
Fetching ...

Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

Wenti Yin, Huaxin Zhang, Xiang Wang, Yuqing Lu, Yicheng Zhang, Bingquan Gong, Jialong Zuo, Li Yu, Changxin Gao, Nong Sang

TL;DR

This paper tackles weakly supervised video anomaly detection by addressing two core challenges: incomplete normality modeling and semantic confusion among anomaly categories. It introduces DSANet, which fuses Self-GGuided Normality Modeling (SG-NM) to learn video-specific normal patterns via dynamic normal prototypes and reconstruction, with Decoupled Contrastive Semantic Alignment (DCSA) to disentangle event and background content for fine-grained classification. The approach leverages CLIP-based cross-modal representations, a lightweight Text Adapter, and a unified training objective to achieve superior performance on XD-Violence and UCF-Crime for both coarse- and fine-grained WS-VAD, with extensive ablations validating each component. The results demonstrate improved temporal localization and better inter-class separability, suggesting strong practical impact for scalable, annotation-efficient anomaly detection in videos.

Abstract

Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

TL;DR

This paper tackles weakly supervised video anomaly detection by addressing two core challenges: incomplete normality modeling and semantic confusion among anomaly categories. It introduces DSANet, which fuses Self-GGuided Normality Modeling (SG-NM) to learn video-specific normal patterns via dynamic normal prototypes and reconstruction, with Decoupled Contrastive Semantic Alignment (DCSA) to disentangle event and background content for fine-grained classification. The approach leverages CLIP-based cross-modal representations, a lightweight Text Adapter, and a unified training objective to achieve superior performance on XD-Violence and UCF-Crime for both coarse- and fine-grained WS-VAD, with extensive ablations validating each component. The results demonstrate improved temporal localization and better inter-class separability, suggesting strong practical impact for scalable, annotation-efficient anomaly detection in videos.

Abstract

Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

Paper Structure

This paper contains 20 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Schematic diagram about motivation. We identify two main issues: 1) limited understanding of normality, and 2) category confusion. We address them through normality modeling and decoupled contrastive semantic alignment.
  • Figure 2: Overview of the proposed DSANet. The model consists of three collaborative branches. The Anomaly Detection Branch produces initial frame-level binary scores using a MIL framework. The Self-Guided Normality Modeling Branch enhances the model's understanding of normal patterns by mining Dynamic Normal Patterns within the video to guide feature reconstruction, improving its ability to distinguish normal from abnormal. The Anomaly Classification Branch aligns video features with textual category embeddings for fine-grained classification, using Lightweight Text Adapters for adaptation and a Decoupled Contrastive Semantic Alignment mechanism to distinguish various anomaly types from normal categories.
  • Figure 3: Detailed structure of the proposed Decoupled Contrastive Semantic Alignment module.
  • Figure 4: t-SNE visualizations for UCF-Crime.
  • Figure 5: Comparison of Frame Distances to DNPs.
  • ...and 2 more figures