Table of Contents
Fetching ...

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, Nong Sang

TL;DR

The paper tackles the challenge of understanding video anomalies that unfold across different temporal scales by proposing HIVAU-70k, a large-scale hierarchical anomaly dataset with clip-, event-, and video-level annotations. It introduces Holmes-VAU, a multimodal framework that uses an anomaly-focused temporal sampler to focus computation on anomaly-rich segments and a fine-tuned LLM to generate structured explanations, trained via hierarchical instruction data. Empirical results show substantial improvements over state-of-the-art methods in anomaly detection and reasoning on open-world surveillance datasets, along with ablations validating the benefits of hierarchical data, ATS, and LoRA-based fine-tuning. The work advances open-world VAU by enabling efficient, interpretable, multi-granularity anomaly understanding with practical implications for surveillance and safety applications.

Abstract

How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

TL;DR

The paper tackles the challenge of understanding video anomalies that unfold across different temporal scales by proposing HIVAU-70k, a large-scale hierarchical anomaly dataset with clip-, event-, and video-level annotations. It introduces Holmes-VAU, a multimodal framework that uses an anomaly-focused temporal sampler to focus computation on anomaly-rich segments and a fine-tuned LLM to generate structured explanations, trained via hierarchical instruction data. Empirical results show substantial improvements over state-of-the-art methods in anomaly detection and reasoning on open-world surveillance datasets, along with ablations validating the benefits of hierarchical data, ATS, and LoRA-based fine-tuning. The work advances open-world VAU by enabling efficient, interpretable, multi-granularity anomaly understanding with practical implications for surveillance and safety applications.

Abstract

How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.

Paper Structure

This paper contains 27 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Motivation. Left: Existing datasets lack the hierarchical structure to capture transient and sustained anomalies across varying temporal scales. Our HIVAU-70k dataset addresses this by providing multi-granularity annotations—clip, event, and video levels—that enable detailed anomaly analysis in complex real-world scenarios. Right: Inspired by Sherlock Holmes’s knack for zeroing in on critical details, our Holmes-VAU method integrates an Anomaly-focused Temporal Sampler with a multi-modal LLM, directing model attention to anomaly-rich segments, which enables models to decode complex, long-term video anomalies efficiently.
  • Figure 2: Data Engine. We present a structured workflow for generating hierarchical annotations across video, event, and clip levels. Clips are first captioned, then processed through a large language model (LLM) with prompts for event summarization. The outputs include clip captions, event summaries, and video summaries, followed by manual checking and refinement. This multi-step approach enriches the dataset with detailed judgments, descriptions, and analyses of anomalies, enabling robust contextual understanding at varying granularities.
  • Figure 3: HIVAU-70kdataset. (a) Duration distributions for clips, events, and full videos, showing dominance of short clips. (b) Hierarchical data organization from clip-level to video-level, enabling perception-to-reasoning insights. (c) Word count variations across annotation levels, with more detailed descriptions at the video level. (d) Sample annotations capturing captioning, judgment, description, and anomaly analysis, highlighting nuanced understanding of anomaly events in complex scenes.
  • Figure 4: Holmes-VAU: a multi-modal-LLM-based video anomaly detection framework with adaptive anomaly focus.
  • Figure 5: Qualitative comparison of anomaly understanding explanation. Compared with state-of-the-art general MLLMs, i.e., InternVL2 chen2024far and QwenVL2 Qwen2VL, our proposed Holmes-VAU demonstrates more accurate anomaly judgment, along with more detailed and comprehensive anomaly-related descriptions and analysis. Correct and wrong explanations are highlighted in green and red, respectively.
  • ...and 11 more figures