Table of Contents
Fetching ...

Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection

Chenglizhao Chen, Xinyu Liu, Mengke Song, Luming Li, Xu Yu, Shanchen Pang

Abstract

Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.

Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection

Abstract

Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.
Paper Structure (38 sections, 15 equations, 9 figures, 8 tables)

This paper contains 38 sections, 15 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Reveal the limitations of existing methods: appearance-based methods fail to detect anomalies due to their low generalizability (A), action-based methods fail due to their less informative (B). "Known Scene" refers to the scene present in the training set, and "Unknown Scene" refers to the scene not present in the training set or those that have significant changes.
  • Figure 2: Compared to appearance-based methods (A), which only rely on low-level visual features, and action-based methods (B), which ignore the relationship between scenes and human actions, our decoupling-based method (C) introduces the concept of "Scene-Action Interweaving". Fully considering the complex connections between actions and the surrounding environment in different video clips.
  • Figure 3: Pipeline of the proposed DecoAD. DecoAD consists of three steps --- Step1: Relational Knowledge Mapper (RKM), Step2: Scene-Action Integrator (SAI) (Stage 1) and Step3: Uncertainty Refinement (Stage 2).
  • Figure 4: Pipeline for processing image in Scene-Action Decoupling.
  • Figure 5: Illustration of Relational Knowledge Mapper.
  • ...and 4 more figures