Table of Contents
Fetching ...

MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation

Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, Mohsen Imani

TL;DR

MissionGNN addresses weakly supervised video anomaly recognition by combining automatically generated mission-specific knowledge graphs with a hierarchical GNN to perform semantic reasoning over multimodal embeddings. The KG for each anomaly type is built via GPT-4 and ConceptNet, and node embeddings come from ImageBind, while a lightweight transformer handles short-term temporal dynamics, all without gradient updates to large multimodal models. A decaying-threshold training strategy enables fully frame-level supervision, with sparsity and smoothing losses to cope with data imbalance. The method achieves strong VAD/VAR performance on standard benchmarks while dramatically reducing memory usage and enabling real-time operation, showcasing practical applicability for surveillance and safety-critical tasks.

Abstract

In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model's performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems. The code is available here: https://github.com/c0510gy/MissionGNN.

MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation

TL;DR

MissionGNN addresses weakly supervised video anomaly recognition by combining automatically generated mission-specific knowledge graphs with a hierarchical GNN to perform semantic reasoning over multimodal embeddings. The KG for each anomaly type is built via GPT-4 and ConceptNet, and node embeddings come from ImageBind, while a lightweight transformer handles short-term temporal dynamics, all without gradient updates to large multimodal models. A decaying-threshold training strategy enables fully frame-level supervision, with sparsity and smoothing losses to cope with data imbalance. The method achieves strong VAD/VAR performance on standard benchmarks while dramatically reducing memory usage and enabling real-time operation, showcasing practical applicability for surveillance and safety-critical tasks.

Abstract

In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model's performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems. The code is available here: https://github.com/c0510gy/MissionGNN.

Paper Structure

This paper contains 21 sections, 11 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The framework for mission-specific knowledge graph generation.
  • Figure 2: The overall framework for our proposed model utilizing the novel concept of hierarchical graph neural network.
  • Figure 3: Detailed process of mission-specific knowledge graph generation.
  • Figure 4: Portion of fails on generating valid (A) nodes and (B) edges on each layer by the number of attempts.
  • Figure 5: Example of KG for detecting the "Shooting" category in the UCF-Crime dataset. Each color represents: Yellow: Sensor Node, Red: Key Concept Nodes, Blue: Sub-graph Nodes, Green: Encoding Node.