Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

Jash Dalvi; Ali Dabouei; Gunjan Dhanuka; Min Xu

Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, Min Xu

TL;DR

The paper tackles weakly-supervised video anomaly detection under limited labeled data by introducing DAKD, a framework that first builds a Teacher model aggregating multiple backbones via a Temporal Aggregation Module with disentangled cross-attention. This aggregated knowledge is then distilled into a lightweight Student through a bi-level pipeline that combines prediction-level distillation (using MIL-derived soft labels) and feature-level distillation (InfoNCE with a temperature parameter), yielding a practical, single-backbone model. Empirical results across UCF-Crime, ShanghaiTech, and XD-Violence demonstrate state-of-the-art performance and robust anomaly localization, with ablations validating the importance of TAM, distillation losses, and backbone diversity. The approach offers significant practical impact for real-world surveillance by balancing accuracy with computational efficiency.

Abstract

Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos. The benchmark setup for this task is extremely challenging due to: i) the limited size of the training sets, ii) weak supervision provided in terms of video-level labels, and iii) intrinsic class imbalance induced by the scarcity of abnormal events. In this work, we show that distilling knowledge from aggregated representations of multiple backbones into a single-backbone Student model achieves state-of-the-art performance. In particular, we develop a bi-level distillation approach along with a novel disentangled cross-attention-based feature aggregation network. Our proposed approach, DAKD (Distilling Aggregated Knowledge with Disentangled Attention), demonstrates superior performance compared to existing methods across multiple benchmark datasets. Notably, we achieve significant improvements of 1.36%, 0.78%, and 7.02% on the UCF-Crime, ShanghaiTech, and XD-Violence datasets, respectively.

Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 6 figures, 5 tables)

This paper contains 16 sections, 8 equations, 6 figures, 5 tables.

Introduction
Related Work
Weakly Supervised VAD
Knowledge Distillation
Method
Feature Extraction
Temporal Aggregation Module
Bi-level Fine-grained Knowledge Distillation
Experiments
Datasets and Metrics
Implementation Details
Comparison with the state of the art
Ablation Study
Qualitative Results
Conclusion
...and 1 more sections

Figures (6)

Figure 1: Left: A brief overview of our approach that distills the multi-backbone Teacher model's knowledge to the Student model. In the Teacher model, representations from multiple backbones are aggregated using our proposed Temporal Aggregation Module. The single-backbone Student model is then trained with bi-level fine-grained knowledge distillation framework. Right: Frame-level predictions for individual backbones vs our proposed feature aggregation method on a video of a Road Accident from the testing set of UCF-Crime.
Figure 2: Schematic diagram of the proposed method. The Teacher model is initially trained with several feature extractors (Section \ref{['sec:featextraction']}) using the Temporal Aggregation Module (Section \ref{['sec:tempnet']}) in Stage 1. Stage 2: Feature-level and prediction-level knowledge distillation is performed to distill the knowledge of the complex Teacher model into the Student model (Section \ref{['sec:distillation']}).
Figure 3: Schematic diagram of the proposed Temporal Aggregation Module. From the $Q^{c_t}$, $K^{c_t}$ and $V^{c_t}$ vectors obtained from the representations of the $t^{th}$ backbone and the relative position-based vectors $Q^r$ and $K^r$, four attention matrices are computed. $A_{c->c}$ is the self content-to-content attention, $A_{c->c'}$ is the cross content-to-content attention, $A_{c->p}$ is the content-to-position attention and $A_{p->c}$ is the position-to-content attention. The output value is calculated in $H_t$, and sftmx represents the softmax operation.
Figure 4: Ablation study on the UCF-Crime dataset to investigate the impact of feature backbones used in the Teacher Model. We observe that the involvement of the CLIP backbone significantly boosts the AUC score. The combination of jointly using all three backbones (I3D, S3D, and CLIP) provides the best performance.
Figure 5: Ablation studies performed on major hyperparameters including the temperature for the contrastive loss $\tau$, the coefficient of the total distillation loss $\alpha$, the maximum relative distance parameter $k$ in the disentangled attention mechanism, and the threshold $\delta$ used to determine class labels for the contrastive loss. The ablations are performed on the UCF-Crime dataset.
...and 1 more figures

Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

TL;DR

Abstract

Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)