Table of Contents
Fetching ...

Video Anomaly Detection and Explanation via Large Language Models

Hui Lv, Qianru Sun

TL;DR

The paper tackles threshold-dependent and opaque video anomaly detection by integrating video-based large language models (VLLMs) into VAD to yield textual explanations. It introduces VAD-LLaMA, combining a VADor with a Long-Term Context module and a three-phase training pipeline to handle long-range context and scarce domain data. Empirical results on UCF-Crime and TAD show state-of-the-art AUC performance and demonstrable ability to describe anomalies; ablations highlight the value of LTC and short-term history. The approach also enables explainable, interactive analysis of anomalies via textual prompts and multi-turn dialogue, signaling practical value for surveillance systems.

Abstract

Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling. We design a three-phase training method to improve the efficiency of fine-tuning VLLMs by substantially minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. Our trained model achieves the top performance on the anomaly videos of the UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%, respectively. More impressively, our approach can provide textual explanations for detected anomalies.

Video Anomaly Detection and Explanation via Large Language Models

TL;DR

The paper tackles threshold-dependent and opaque video anomaly detection by integrating video-based large language models (VLLMs) into VAD to yield textual explanations. It introduces VAD-LLaMA, combining a VADor with a Long-Term Context module and a three-phase training pipeline to handle long-range context and scarce domain data. Empirical results on UCF-Crime and TAD show state-of-the-art AUC performance and demonstrable ability to describe anomalies; ablations highlight the value of LTC and short-term history. The approach also enables explainable, interactive analysis of anomalies via textual prompts and multi-turn dialogue, signaling practical value for surveillance systems.

Abstract

Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling. We design a three-phase training method to improve the efficiency of fine-tuning VLLMs by substantially minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. Our trained model achieves the top performance on the anomaly videos of the UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%, respectively. More impressively, our approach can provide textual explanations for detected anomalies.
Paper Structure (12 sections, 4 equations, 6 figures, 4 tables)

This paper contains 12 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Prediction scores from a baseline VAD model, and clip descriptions by using VLLMs, for a car accident video (as shown in the middle of the figure). On the score curve, the red dashed lines denote anomaly thresholds. The bottom shows the answers from Video-LLaMA zhang2023video by feeding it with two pairs of video clips and questions, respectively: {Green: a normal video clip, "Is there any anomaly in the video?"} and {Orange: an abnormal video clip, "Is there a car accident? If so, is it an anomaly?"}
  • Figure 2: The network architecture of the proposed VAD-LLaMA. It consists of a Video Anomaly Detector (VADor) with the Long-Term Context (LTC) module and a simple Anomaly Predictor (AP), a projection layer (called Adaptor), and the pre-trained Video-LLaMA zhang2023video (composed by a Video Encoder (VE) and a LLaMA). The training of VAD-LLaMA is decomposed into three phases, and the trainable and frozen modules vary among different training phases. Training phases are given in Figure \ref{['fig:steps']}.
  • Figure 3: The training phase of VAD-LLMs consists of three phases. 1) VAD baseline training, 2) VAD co-training with LTC, and 3) Instruction-tuning Adaptor. In the LTC module, $\mathbf{N}$ and $\mathbf{A}$ represent the long-term normal and abnormal feature lists, respectively. The red arrow denotes the generation process from anomaly scores to pseudo instructions with text templates.
  • Figure 4: An abuse example for comparison between the VAD-LLaMA and Video-LLaMA. The red boxes in the frames are ground-truth anomalies. The orange boxes are the question from humans. The gray and blue boxes are the answers from the Video-LLaMA and our VAD-LLaMA, respectively. Best viewed in color.
  • Figure 5: Two qualitative examples of VAD-LLaMA.
  • ...and 1 more figures