Table of Contents
Fetching ...

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, Nong Sang

TL;DR

Holmes-VAD tackles biased and opaque video anomaly detection by coupling precise single-frame supervision with multimodal instruction tuning to enable both accurate localization and explanations.A new benchmark, VAD-Instruct50k, provides single-frame anomaly annotations and instruction-grounded explanations for trimmed clips, built via a semi-automatic data engine and LLM-assisted generation.The model architecture combines a frozen Video Encoder, a trainable Temporal Sampler, and a fine-tuned Multi-modal LLM (with LoRA) to produce frame-level anomaly scores and natural language explanations.Empirical results show state-of-the-art performance on XD-Violence and UCF-Crime, along with favorable human-evaluated explanations, supporting the practicality of interpretable VAD in real-world settings.

Abstract

Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://holmesvad.github.io.

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

TL;DR

Holmes-VAD tackles biased and opaque video anomaly detection by coupling precise single-frame supervision with multimodal instruction tuning to enable both accurate localization and explanations.A new benchmark, VAD-Instruct50k, provides single-frame anomaly annotations and instruction-grounded explanations for trimmed clips, built via a semi-automatic data engine and LLM-assisted generation.The model architecture combines a frozen Video Encoder, a trainable Temporal Sampler, and a fine-tuned Multi-modal LLM (with LoRA) to produce frame-level anomaly scores and natural language explanations.Empirical results show state-of-the-art performance on XD-Violence and UCF-Crime, along with favorable human-evaluated explanations, supporting the practicality of interpretable VAD in real-world settings.

Abstract

Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://holmesvad.github.io.
Paper Structure (20 sections, 8 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Towards unbiased and explainable VAD. In contrast to prevailing VAD approaches (a) that primarily concentrate on identifying anomalies, our method (b) facilitates not only unbiased (i.e., less false alarms toward easily cofused or unseen normality) predictions of anomaly scores but also explanation of detected anomalies, through constructing a large scale VAD dataset with single-frame annotations for untrimmed videos and explanable instruction data for trimmed videos.
  • Figure 2: Data engine for the proposed VAD-Instruct50k. We collect numerous abnormal/normal videos from exsiting datasets, following by a series of annotation enhancement including temporal single-frame annotation, event clips generation and event clips captioning. Then we construct the instruction data by prompting the powerful LLM with the enhanced annotation. Throughout the pipeline, manual work and large fundation models coordinated with each other to ensure efficiency and quality in construction.
  • Figure 3: Overview of Holmes-VAD. Holmes-VAD takes untrimmed video and user prompt as inputs, and takes the anomaly scores and explanation for detected anomalies outputs. The Temporal Sampler takes class tokens of frames as input and estimates the anomaly scores, and the dense visual tokens are resampled accroding to their anomaly scores before entering the projector.
  • Figure 4: Human evaluation on models under different training settings.
  • Figure 5: Ablation study of backbone and supervision in Temporal Sampler.
  • ...and 8 more figures