Table of Contents
Fetching ...

Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

Ali Zia, Usman Ali, Muhammad Umer Ramzan, Hamza Abid, Abdul Rehman, Wei Xiang

Abstract

Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

Abstract

Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.
Paper Structure (16 sections, 9 equations, 3 figures, 5 tables)

This paper contains 16 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of prior VAD architectures and our modality-agnostic MM-VAD, which introduces adaptive query optimisation and hyperbolic fusion for zero-shot, explainable anomaly detection.
  • Figure 2: An overview of our MM-VAD framework. Visual captions $C_{\text{vis}}$ and audio captions $C_{\text{aud}}$ are fused in hyperbolic space, and the fused features drive unsupervised refinement of the anomaly query from $Q_t$ to $Q_{t+1}$. A LLaMA-based scorer then evaluates textual video summaries against $Q_{t+1}$, and a cross-modal refinement step produces the final anomaly scores.
  • Figure 3: Overview of MM-VAD on XD-Violence. (a) MM-VAD outperforms LAVAD with sharper anomaly localisation and richer action-aware descriptions. (b) Adaptive prompts ($t_0$--$t_4$) for frozen-LLM querying. (c) Euclidean embeddings. (d) Hyperbolic embeddings with clearer structure and better separation.