Table of Contents
Fetching ...

Hawk: Learning to Understand Open-World Video Anomalies

Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Ying-Cong Chen

TL;DR

Hawk addresses the challenge of open-world video anomaly understanding by coupling an explicit motion modality with a large visual-language model. It introduces a dual-branch architecture (appearance and motion), a mutual-information-based attention mechanism, and motion-language supervision to tightly align motion cues with linguistic descriptions. The data engineically augments seven anomaly datasets with dense anomaly descriptions and extensive open-world QA pairs, and the model is pretrained on WebVid before fine-tuning on anomaly data. Empirically, Hawk achieves state-of-the-art results on both anomaly description generation and open-world question answering, demonstrating strong generalization to diverse scenarios and practical interactive capabilities.

Abstract

Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, Hawk explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions. The final results demonstrate that Hawk achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk.

Hawk: Learning to Understand Open-World Video Anomalies

TL;DR

Hawk addresses the challenge of open-world video anomaly understanding by coupling an explicit motion modality with a large visual-language model. It introduces a dual-branch architecture (appearance and motion), a mutual-information-based attention mechanism, and motion-language supervision to tightly align motion cues with linguistic descriptions. The data engineically augments seven anomaly datasets with dense anomaly descriptions and extensive open-world QA pairs, and the model is pretrained on WebVid before fine-tuning on anomaly data. Empirically, Hawk achieves state-of-the-art results on both anomaly description generation and open-world question answering, demonstrating strong generalization to diverse scenarios and practical interactive capabilities.

Abstract

Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, Hawk explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions. The final results demonstrate that Hawk achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk.
Paper Structure (47 sections, 8 equations, 8 figures, 4 tables)

This paper contains 47 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Different framework in video anomaly detection. (A) shows traditional video anomaly detection methods, which use binary classifiers to detect anomalies. (B), following (A), introduces a multi-class classifier for integrating semantic information, allowing users to obtain different types of anomaly information. Neither (A) nor (B) can interact with users. (C) is a previous video understanding framework that can interactively provide richer semantic information for users, but cannot specifically locate video anomalies. Our framework (D) enhances the anomaly understanding capability and provides annotated labels with rich semantic information.
  • Figure 2: Generation pipeline of our dataset. In the first line, we first segment videos into clips and generate dense captions for each segment, including a comprehensive description of the video content. Then, we use GPT-4 to guide the generation of corresponding anomalous video descriptions based on these descriptions, which are then manually checked to reduce mistakes. In the second line, to generate user-centered QA pairs, we first use GPT-4 to generate open-ended questions based on the proposed two principles. Then, the questions and video descriptions are jointly input into GPT-4 to provide possible answers.
  • Figure 3: Overview of Hawk. During training (Black and Gray path), we aim to optimize for video-language matching loss, along with Video-Motion Consistency and Motion-Language Matching. During inference (only Gray path), we generate language descriptions using video, motion, and text.
  • Figure 4: Visualization of Hawk's loss. 1 is the original video-to-language loss. 2 is the cosine similarity loss for motion modality adaptation. 3 is the motion-to-language loss.
  • Figure 5: Training & Testing.
  • ...and 3 more figures