Table of Contents
Fetching ...

RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Junhee Lee, ChaeBeen Bang, MyoungChul Kim, MyeongAh Cho

TL;DR

RefineVAD tackles weakly supervised video anomaly detection by addressing both how motion unfolds over time and what semantic category the anomaly belongs to. The framework combines MoTAR, which uses motion salience to dynamically recalibrate temporal features and capture long-range dependencies, with CORE, which injects soft category priors through learnable prototypes via cross-attention. This category-aware refinement guides anomaly scoring toward semantically meaningful patterns, improving localization and interpretability. Across WVAD benchmarks, RefineVAD achieves state-of-the-art results and demonstrates strong cross-dataset transfer, highlighting the value of integrating semantic structure with temporal dynamics for practical anomaly detection.

Abstract

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

TL;DR

RefineVAD tackles weakly supervised video anomaly detection by addressing both how motion unfolds over time and what semantic category the anomaly belongs to. The framework combines MoTAR, which uses motion salience to dynamically recalibrate temporal features and capture long-range dependencies, with CORE, which injects soft category priors through learnable prototypes via cross-attention. This category-aware refinement guides anomaly scoring toward semantically meaningful patterns, improving localization and interpretability. Across WVAD benchmarks, RefineVAD achieves state-of-the-art results and demonstrates strong cross-dataset transfer, highlighting the value of integrating semantic structure with temporal dynamics for practical anomaly detection.

Abstract

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of WVAD paradigms. (a) Coarse-grained model predicts anomaly scores per snippet. (b) Fine-grained method introduces auxiliary video-level category classification, but these labels are not utilized during anomaly scoring. (c) RefineVAD (Ours): category-aware soft classification guides feature enhancement with learnable prototypes.
  • Figure 2: The overall architecture of proposed RefineVAD. Visual and textual features are extracted from each segment and fused before being processed by MoTAR, which adaptively recalibrates temporal features based on motion salience. The resulting features are then refined by CORE, which injects soft category priors via cross-attention with learnable prototypes, enabling category-aware anomaly localization.
  • Figure 3: Overview of MoTAR. Motion variance guides adaptive channel shifting, enabling dynamic temporal feature aggregation. A Global Transformer captures long-range dependencies, improving sensitivity to diverse motions.
  • Figure 4: t-SNE visualization of logit features, where semantically similar categories form meaningful clusters.
  • Figure 5: Example frame (top) and corresponding predicted scores with ground truth (bottom). The blue curve represents the predicted scores, and the red shaded regions indicate the ground-truth anomalous intervals.