Table of Contents
Fetching ...

Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W. T. Fok, Xiaojuan Qi, Yik-Chung Wu

TL;DR

The paper tackles the challenge of detecting and describing anomalies in videos by addressing spatial and temporal sparsity. It introduces VA-GPT, an MLLM that aligns effective visual tokens with an LLM through two modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). A two-stage training strategy and a new instruct-following dataset for anomalies, plus a cross-domain XD-Violence Not-only-look benchmark, demonstrate state-of-the-art performance in anomaly localization and cross-domain generalization. The work shows that selective token mechanisms can significantly improve the reliability and interpretability of video anomaly understanding in multimodal large language models, with potential impact on security, surveillance, and safety applications.

Abstract

Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

Aligning Effective Tokens with Video Anomaly in Large Language Models

TL;DR

The paper tackles the challenge of detecting and describing anomalies in videos by addressing spatial and temporal sparsity. It introduces VA-GPT, an MLLM that aligns effective visual tokens with an LLM through two modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). A two-stage training strategy and a new instruct-following dataset for anomalies, plus a cross-domain XD-Violence Not-only-look benchmark, demonstrate state-of-the-art performance in anomaly localization and cross-domain generalization. The work shows that selective token mechanisms can significantly improve the reliability and interpretability of video anomaly understanding in multimodal large language models, with potential impact on security, surveillance, and safety applications.

Abstract

Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Baseline video understanding MLLM feeds forward every visual token (yellow squares) equally to participate in fine-tuning and inference (top row). Different from it, our method focuses on the effective area (unobstructed area in medium video frames) in each frame and select the Spatial Effective Tokens (orange squares) for the LLM (see Section \ref{['subsec:SETS']}) (filtered tokens are shown as gray squares). At the same time, we generate anomaly-aware Temporal Effective Tokens (green squares) (see Section \ref{['subsec:TETG']}) based on the assigned anomaly scores (denoted as s) of each frame from a pre-trained classifier for better temporal localization of anomalies.
  • Figure 2: Detailed illustration of our proposed model. When a video is fed into the model, patch embeddings and class embeddings (c.ebd) are extracted from all frames. 1) Based on the difference in patch embeddings between current frame and its neighbour frame, we can get a filter mask to filter out unimportant visual tokens (dashed square ) from current frame's visual tokens , thereby selecting Spatial Effective Tokens that are compressed with a projector with pooling into aligned content token for each frame, meanwhile take attention with text input from users for resulting aligned context token for each frame; 2) Based on class embeddings (c.ebd) of all frames, we use a pre-trained Anomaly-aware Classifier to localize the time period of abnormal events, thereby generating Temporal Effective Tokens to feed forward into the LLM. All of the resulting aligned tokens are fed into the LLM for reasoning and inference of the whole model.
  • Figure 3: Qualitative results in Question-Answer diagrams, the red circles in the figures correspond to the bold text in the answers. From short video of only a dozen seconds to medium video of longer than one minute and long video of about half an hour, our model can reason well and understand the content.
  • Figure 4: Visualization of the initial videos and our masked results. These two cases illustrate road accident scenarios: one occurring in a bustling street and the other in an empty suburb. Our SETS effectively filters redundant and irrelevant regions (with black patch-level masks).