Table of Contents
Fetching ...

Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection

Hantao Zheng, Ning Han, Yawen Zeng, Hao Chen

Abstract

Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.

Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection

Abstract

Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.
Paper Structure (16 sections, 8 equations, 3 figures, 4 tables)

This paper contains 16 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The sensitivity–stability trade-off. (a) Unified Optimization Dilemma. Prioritizing sensitivity yields fragmented predictions (bottom), whereas enforcing stability leads to over-smoothed responses (top). (b) DeSC resolves this via two decoupled and collaborative streams. GT denotes ground truth.
  • Figure 2: Overview of the DeSC framework. DeSC uses frozen CLIP features and two decoupled streams optimized with distinct objectives. The Temporal Sensitivity Stream applies an acausal TCN and graph transformer to capture transient anomalies under aggressive optimization, while the Semantic Consistency Stream uses a local transformer, global GCN, and multimodal Gaussian prior to maintain stable long-range semantics under robust optimization. Sliding-window test-time augmentation and collaborative fusion integrate both outputs for final anomaly scores.
  • Figure 3: Qualitative comparison on XD-Violence and UCF-Crime. Curves show anomaly scores from the Temporal Sensitivity Stream (blue), the Semantic Consistency Stream (yellow), and their fusion in DeSC (red). Color bars below display binary detections aligned with the ground truth (grey).