Table of Contents
Fetching ...

Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan

TL;DR

The paper tackles the computational burden of applying large pretrained models to video anomaly detection by asking whether dense frame-level reasoning is necessary. It introduces ReCoVAD, a dual-pathway framework that couples a lightweight Reflex path with a CLIP-based prototype fusion and a Conscious path featuring a 7B LVLM and an LLM to describe and reason about novel events, guided by a dynamically updated knowledge prompt. The system forms a memory-driven loop where the Reflex path filters routine frames while the Conscious path handles novelty and refines both memory and prompts, enabling top-down improvement. On UCF-Crime and XD-Violence under training-free evaluation, ReCoVAD achieves state-of-the-art performance while processing only a fraction of frames (about 28.55% and 16.04%), illustrating that sparse reasoning with large models can be both effective and efficient.

Abstract

Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

TL;DR

The paper tackles the computational burden of applying large pretrained models to video anomaly detection by asking whether dense frame-level reasoning is necessary. It introduces ReCoVAD, a dual-pathway framework that couples a lightweight Reflex path with a CLIP-based prototype fusion and a Conscious path featuring a 7B LVLM and an LLM to describe and reason about novel events, guided by a dynamically updated knowledge prompt. The system forms a memory-driven loop where the Reflex path filters routine frames while the Conscious path handles novelty and refines both memory and prompts, enabling top-down improvement. On UCF-Crime and XD-Violence under training-free evaluation, ReCoVAD achieves state-of-the-art performance while processing only a fraction of frames (about 28.55% and 16.04%), illustrating that sparse reasoning with large models can be both effective and efficient.

Abstract

Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

Paper Structure

This paper contains 23 sections, 6 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) The nervous system routes familiar signals through low-cost reflex arcs, while allocating cortical resources only to novel or complex inputs via the thalamus–cortex loop. The Bidirectional feedback enables top-down modulation of reflexes and bottom-up filtering of redundant information (b) ReCoVAD mirrors this architecture: the reflex pathway filters redundant frames by leveraging visual signals and prompt-conditioned decision vectors, while the conscious pathway applies VLM/LLM-based reasoning for novel inputs. Through a feedback loop, it refines both memory and prompts, progressively enhancing selectivity.
  • Figure 2: ReCoVAD consists of a Reflex and a Conscious pathway. The Reflex pathway employs a lightweight CLIP model to construct decision vectors $X_I$ by matching frame $I$’s visual features with textual event prototypes in the knowledge prompt $\mathcal{P}$. It then queries a dynamic memory $\mathcal{M}$ of representative records to decide if deep analysis is needed. If not, the anomaly score is retrieved directly via the reflex function based on the frame's neighbors in $\mathcal{M}$. Otherwise, the Conscious pathway processes the frame using a VLM anomaly analyzer to generate event descriptions and anomaly scores under the guidance of $\mathcal{P}$, updating $\mathcal{M}$ with new records and contributing to the description set $\mathcal{B}$. An LLM-based reasoner then periodically samples from $\mathcal{B}$ to revise prior decisions and refine $\mathcal{P}$, which in turn guides both pathways for top-down refinement.
  • Figure 3: Visualization of the predictions made by the reflex pathway and the conscious pathway.
  • Figure 4: The ablation on parameter $K$ and the radius of the decision hypersphere in $F_{reflex}$
  • Figure 5: Ablation on the parameter $\epsilon$ in $F_{filter}$
  • ...and 6 more figures