Table of Contents
Fetching ...

CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

Yang Liu, Hongjin Wang, Zepu Wang, Xiaoguang Zhu, Jing Liu, Peng Sun, Rui Tang, Jianwei Du, Victor C. M. Leung, Liang Song

TL;DR

Video anomaly detection often suffers from scene bias and data shifts when trained only on normal footage. CRCL introduces a causality-inspired framework that combines Scene-debiasing Learning (SdL) with Causality-inspired Normality Learning (CiNL) to isolate normality-causing factors and enforce representation consistency. Grounded in a Structural Causal Model for scene robustness and Total Direct Effect (TDE) debiasing, CRCL uses a memory-based prototype store, shared/private feature decomposition, and correlation-based constraints to learn causal representations that remain stable under scene changes. Empirically, CRCL achieves state-of-the-art or competitive results across single- and multi-scene benchmarks (Ped2, Avenue, ShanghaiTech, NWPU Campus), demonstrates strong robustness to limited training data, and delivers real-time inference, highlighting its practical applicability for surveillance systems. The work provides a principled approach to disentangling scene bias from normality, enabling reliable open-set anomaly detection in diverse environments via causal representation consistency.

Abstract

Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

TL;DR

Video anomaly detection often suffers from scene bias and data shifts when trained only on normal footage. CRCL introduces a causality-inspired framework that combines Scene-debiasing Learning (SdL) with Causality-inspired Normality Learning (CiNL) to isolate normality-causing factors and enforce representation consistency. Grounded in a Structural Causal Model for scene robustness and Total Direct Effect (TDE) debiasing, CRCL uses a memory-based prototype store, shared/private feature decomposition, and correlation-based constraints to learn causal representations that remain stable under scene changes. Empirically, CRCL achieves state-of-the-art or competitive results across single- and multi-scene benchmarks (Ped2, Avenue, ShanghaiTech, NWPU Campus), demonstrates strong robustness to limited training data, and delivers real-time inference, highlighting its practical applicability for surveillance systems. The work provides a principled approach to disentangling scene bias from normality, enabling reliable open-set anomaly detection in diverse environments via causal representation consistency.

Abstract

Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

Paper Structure

This paper contains 31 sections, 17 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Structural causal models for (a) unsupervised video anomaly detection and (b) scene-debiasing learning from the causality perspective and (c) the schematic diagram of the plausibility analysis of causal representation consistency learning. The CVAD-SCM in (a) highlights the limitations of DeepReL models, which emphasizes establishing the statistical dependencies (dashed arrow) between normal videos $\bm{X}$ and labels $\bm{Y$ but overlook the exploration of causal variable $\bm{Z}$. The Sd-SCM in (b) demonstrates that the deep representation $\bm{F}_{ent}$ is usually an entanglement of the normality-endogenous feature $\bm{F}_n$ and the scene bias $\bm{F}_{sce}$. The sparse mechanism shift shown in (c) posits that label-independent offsets across normal events ($\bm{n}\to \bm{n}^\prime$) exert a limited and localized influence on the learned causality (noted by ). Conversely, anomalous events ($\bm{n}\to \bm{a$) engender an outright breakdown (noted by ) in the inherent consistency.
  • Figure 2: Pipeline overview of the CRCL. SdL utilizes the scene encoder $E_s$ and classifiers $\{\mathcal{C}_s,\mathcal{C}_m \}$ to perceive scene biases in the entangled representation $\bm{F}_{ent}$ from the motion-aware feature extractor $E_m$ and de-biases them from a consistency perspective with the TDE process. In contrast, CiNL consists of a memory network $\mathcal{M}$, prototype decomposer, and CiC to mine the causal variable and utilize a clustering algorithm to obtain task-specific representations.
  • Figure 3: Pipeline of the memory network. We introduce filtering strategy that retains only high-similarity items to reconstruct prototype features.
  • Figure 4: Pipeline of memory-based prototype recording and private-shared features decomposition. The memory network updates memory items ($\mathcal{M}_t \to \mathcal{M}_{t+1}$) to store the prototype features of regular events through attention addressing mechanism, while the prototype decomposer splits private $\bm{F}_p$ and shared features $\bm{F}_s$ with two pooling operations and two MLPs $\{\theta_1, \theta_2\}$.
  • Figure 5: Pipeline of the temporal attention. $E_m$ uses channel variance-based temporal attention to actively capture important motion dynamics.
  • ...and 3 more figures