MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation
Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, Mohsen Imani
TL;DR
MissionGNN addresses weakly supervised video anomaly recognition by combining automatically generated mission-specific knowledge graphs with a hierarchical GNN to perform semantic reasoning over multimodal embeddings. The KG for each anomaly type is built via GPT-4 and ConceptNet, and node embeddings come from ImageBind, while a lightweight transformer handles short-term temporal dynamics, all without gradient updates to large multimodal models. A decaying-threshold training strategy enables fully frame-level supervision, with sparsity and smoothing losses to cope with data imbalance. The method achieves strong VAD/VAR performance on standard benchmarks while dramatically reducing memory usage and enabling real-time operation, showcasing practical applicability for surveillance and safety-critical tasks.
Abstract
In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model's performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems. The code is available here: https://github.com/c0510gy/MissionGNN.
