Table of Contents
Fetching ...

HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

Mohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi, Mohammad Reza Nemati, Mobin Azadani, Hamid Reza Rastad, Amirmohammad Akbari

TL;DR

HyCoVAD tackles complex, interaction-driven video anomaly detection by marrying a multi-task self-supervised SSL temporal analyzer with a training-free LLM-based semantic validator. The SSL component, built on an nnFormer backbone, generates anomaly proposals across multiple proxy tasks, while the LLM validates these proposals using caption-derived semantics and environment-specific rules, aided by caption refinement and rule aggregation. On ComplexVAD, HyCoVAD achieves a new state-of-the-art 72.5% frame-level AUC, outperforming baselines by 12.5% and reducing LLM usage by about half through a coarse-to-fine filtering strategy. The work advances practical complex VAD by delivering interpretable, rule-grounded decisions with improved computational efficiency and provides resources such as an interaction anomaly taxonomy and adaptive thresholding protocol for future research.

Abstract

Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.

HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

TL;DR

HyCoVAD tackles complex, interaction-driven video anomaly detection by marrying a multi-task self-supervised SSL temporal analyzer with a training-free LLM-based semantic validator. The SSL component, built on an nnFormer backbone, generates anomaly proposals across multiple proxy tasks, while the LLM validates these proposals using caption-derived semantics and environment-specific rules, aided by caption refinement and rule aggregation. On ComplexVAD, HyCoVAD achieves a new state-of-the-art 72.5% frame-level AUC, outperforming baselines by 12.5% and reducing LLM usage by about half through a coarse-to-fine filtering strategy. The work advances practical complex VAD by delivering interpretable, rule-grounded decisions with improved computational efficiency and provides resources such as an interaction anomaly taxonomy and adaptive thresholding protocol for future research.

Abstract

Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.

Paper Structure

This paper contains 35 sections, 13 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of the proposed two-stage video anomaly detection framework. Stage 1 processes video frames at both object and frame levels. Objects are detected, and object-centric temporal sequences are extracted. A multi-task self-supervised deep model with a shared nnFormer encoder performs middle-frame prediction with memory-augmented autoencoding, motion irregularity classification, jigsaw-based spatial prediction, and SocialGRU trajectory prediction with interleaved attention. Task-specific weights are learned using GradNorm. Stage 2 performs semantic verification: frame captions are generated and refined using temporal and embedding-based alignment, and a LLM evaluates each frame against automatically generated environment-specific rules. Final predictions are smoothed with majority smoothing. In the output examples, the first frame is detected as normal, where a scooter moves in a straight direction along the street. However, in the subsequent three frames, the scooter suddenly turns, an illegal and dangerous maneuver. This behavior is flagged as anomalous by the system, highlighting its ability to capture context-dependent rule violations.