HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

Mohammad Mahdi Hemmatyar; Mahdi Jafari; Mohammad Amin Yousefi; Mohammad Reza Nemati; Mobin Azadani; Hamid Reza Rastad; Amirmohammad Akbari

HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

Mohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi, Mohammad Reza Nemati, Mobin Azadani, Hamid Reza Rastad, Amirmohammad Akbari

TL;DR

HyCoVAD tackles complex, interaction-driven video anomaly detection by marrying a multi-task self-supervised SSL temporal analyzer with a training-free LLM-based semantic validator. The SSL component, built on an nnFormer backbone, generates anomaly proposals across multiple proxy tasks, while the LLM validates these proposals using caption-derived semantics and environment-specific rules, aided by caption refinement and rule aggregation. On ComplexVAD, HyCoVAD achieves a new state-of-the-art 72.5% frame-level AUC, outperforming baselines by 12.5% and reducing LLM usage by about half through a coarse-to-fine filtering strategy. The work advances practical complex VAD by delivering interpretable, rule-grounded decisions with improved computational efficiency and provides resources such as an interaction anomaly taxonomy and adaptive thresholding protocol for future research.

Abstract

Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.

HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

TL;DR

Abstract

HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)