Table of Contents
Fetching ...

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

Zhaolin Cai, Fan Li, Huiyu Duan, Lijun He, Guangtao Zhai

TL;DR

This work proposes a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations, establishing it as a powerful new direction for video anomaly detection.

Abstract

Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

TL;DR

This work proposes a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations, establishing it as a powerful new direction for video anomaly detection.

Abstract

Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.
Paper Structure (92 sections, 18 equations, 15 figures, 18 tables, 3 algorithms)

This paper contains 92 sections, 18 equations, 15 figures, 18 tables, 3 algorithms.

Figures (15)

  • Figure 1: Comparison of traditional full-training methods, existing tuning-free methods and our proposed SteerVAD. Our method overcomes the issue of costly training resources and inherent biases with minimal data required from pre-trained foundation models compared to previous methods.
  • Figure 2: 3D UMAP visualization of representation manifolds of normal (blue) and anomalous (red) events from InternVL, illustrating their geometric structure. Each manifold is rendered from two perspectives using (a) cubic interpolation and (b) triangulation.
  • Figure 3: Framework overview of SteerVAD. We first apply the Representational Separability Analysis to find top $K$ Latent Anomaly Experts inside frozen MLLM. During the single pass, The global context vector $\mathbf{c}$ and LAE features $\{\mathbf{h}_i\}$ are extracted. The Hierarchical Meta-Controller ingests these signals, using Global Scrutiny Gate and Local Gating Module to generate manipulation signals ($s_{\text{global}}$, $\{\mathbf{g}_i\}$). These signals perform Anisotropic Manifold Scaling to rectify LAE features. A lightweight Anomaly Scorer receives the rectified features and outputs the final anomaly curve. Detected anomalous frames can be passed to the full MLLM to produce a textual explanation.
  • Figure 4: Visualization of manifold rectification via t-SNE on aggregated features from the LAEs for the XD-Violence and UCF-Crime datasets.
  • Figure 5: Evolution of anomaly expert activations across different calibration data scales. The consistent distribution of hotspots from 1% to 100% visually confirms that the identification of anomaly experts saturates rapidly.
  • ...and 10 more figures