Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models
He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan
TL;DR
The paper tackles the computational burden of applying large pretrained models to video anomaly detection by asking whether dense frame-level reasoning is necessary. It introduces ReCoVAD, a dual-pathway framework that couples a lightweight Reflex path with a CLIP-based prototype fusion and a Conscious path featuring a 7B LVLM and an LLM to describe and reason about novel events, guided by a dynamically updated knowledge prompt. The system forms a memory-driven loop where the Reflex path filters routine frames while the Conscious path handles novelty and refines both memory and prompts, enabling top-down improvement. On UCF-Crime and XD-Violence under training-free evaluation, ReCoVAD achieves state-of-the-art performance while processing only a fraction of frames (about 28.55% and 16.04%), illustrating that sparse reasoning with large models can be both effective and efficient.
Abstract
Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.
