HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs
Zhaolin Cai, Fan Li, Ziwei Zheng, Yanjun Qin
TL;DR
This paper tackles video anomaly detection (VAD) under tuning-free constraints by exploiting intermediate hidden states in pre-trained Multimodal LLMs (MLLMs). It uncovers an Information-rich Phenomenon where intermediate layers (peaking near layer $l\approx20$) offer higher anomaly sensitivity, linear separability, and information concentration, quantified via $D_{ ext{KL}}(l)$, $LDR(l)$, and $H(l)$. Building on this, the authors propose HiProbe-VAD, featuring Dynamic Layer Saliency Probing to select the optimal layer $l^*$, a lightweight logistic-regression anomaly scorer, and temporal localization with explainable VAD outputs. Experiments on UCF-Crime and XD-Violence show competitive performance against tuning-free and self-supervised baselines, with strong cross-model generalization across multiple MLLMs and notable zero-shot transfer capabilities, highlighting practicality and scalability for real-world deployment.
Abstract
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
