Table of Contents
Fetching ...

Harnessing Large Language Models for Training-free Video Anomaly Detection

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci

TL;DR

This work tackles video anomaly detection in a training-free setting by exploiting pre-trained vision-language models and large language models. The proposed LAVAD pipeline uses (i) captioning to describe frames, (ii) caption cleaning via cross-modal similarity, (iii) LLM-driven temporal aggregation to generate frame-wise anomaly scores, and (iv) video-text score refinement to align scores with visual context. It achieves competitive results on UCF-Crime and XD-Violence without data collection or model training, outperforming other training-free baselines and surpassing some unsupervised methods in AUC-ROC. The study demonstrates the potential of language models to perform temporal anomaly reasoning in vision tasks, while highlighting practical considerations such as caption reliability and prompt design for real-world deployment.

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Harnessing Large Language Models for Training-free Video Anomaly Detection

TL;DR

This work tackles video anomaly detection in a training-free setting by exploiting pre-trained vision-language models and large language models. The proposed LAVAD pipeline uses (i) captioning to describe frames, (ii) caption cleaning via cross-modal similarity, (iii) LLM-driven temporal aggregation to generate frame-wise anomaly scores, and (iv) video-text score refinement to align scores with visual context. It achieves competitive results on UCF-Crime and XD-Violence without data collection or model training, outperforming other training-free baselines and surpassing some unsupervised methods in AUC-ROC. The study demonstrates the potential of language models to perform temporal anomaly reasoning in vision tasks, while highlighting practical considerations such as caption reliability and prompt design for real-world deployment.

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
Paper Structure (15 sections, 5 equations, 7 figures, 8 tables)

This paper contains 15 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: We introduce the first training-free method for video anomaly detection (VAD), diverging from state-of-the-art methods that are ALL training-based with different degrees of supervision. Our proposal, LAVAD, leverages modality-aligned vision-language models (VLMs) to query and enhance the anomaly scores generated by large language models (LLMs).
  • Figure 2: Bar plot of the VAD performance (AUC ROC) by querying LLMs with textual descriptions of video frames from various captioning models on the UCF-Crime test set. Different bars correspond to different variants of the captioning model BLIP-2 li2023blip, while different colors indicate two different LLMs touvron2023llamajiang2023mistral. For reference, we also plot the performance of the best-performing unsupervised method thakare2023dyannet in a red dashed line, and that of a random classifier in a gray dashed line.
  • Figure 3: The anomaly score predicted by Llama touvron2023llama over time for video Shooting033 from UCF-Crime. We highlight some sample frames with their associated BLIP-2 captions to demonstrate that the caption can be semantically noisy or incorrect (red bounding boxes are for abnormal predictions while blue bounding boxes are for normal predictions). Ground-truth anomalies are highlighted. In particular, the caption of the frame enclosed by a blue bounding box within the ground truth anomaly fails to accurately represent the visual content, leading to a wrong classification due to the low anomaly score given by the LLM.
  • Figure 4: The architecture of our proposed LAVAD for addressing training-free VAD. For each test video $\mathbf{V}$, we first employ a captioning model to generate a caption $C_i$ for each frame $\mathbf{I}_i \in \mathbf{V}$, forming a caption sequence $\mathbf{C}$. Our Image-Text Caption Cleaning component addresses noisy and incorrect raw captions based on cross-modal similarity. We replace the raw caption with a caption $\hat{C}_i \in \mathbf{C}$ whose textual embedding $\mathcal{E}_T(\hat{C}_i)$ is most aligned to the image embedding $\mathcal{E}_I(\mathbf{I}_i)$, resulting in a cleaned caption sequence $\hat{\mathbf{C}}$. To account for scene context and dynamics, our LLM-based Anomaly Scoring component further aggregates the cleaned captions within a temporal window centered around each $\mathbf{I}_i$ by prompting the LLM to produce a temporal summary $S_i$, forming a summary sequence $\mathbf{S}$. The LLM is then queried to provide an anomaly score for each frame based on its $S_i$, obtaining the initial anomaly scores $\mathbf{a}$ for all frames. Finally, our Video-Text Score Refinement component refines each $a_i$ by aggregating the initial anomaly scores of frames whose textual embeddings of the summaries are mostly aligned to the representation $\mathcal{E}_V(\mathbf{V}_i)$ of the video snippet $\mathbf{V}_i$ centered around $\mathbf{I}_i$, leading to the final anomaly scores $\mathbf{\tilde{a}}$ for detecting the anomalies (anomalous frames are highlighted) within the video.
  • Figure 5: We showcase qualitative results obtained by LAVAD on four test videos, including two videos (top row) from UCF-Crime and two videos from XD-Violence (bottom row). For each video, we plot the anomaly score over frames computed by our method. We display some keyframes alongside their most aligned temporal summary (blue bounding boxes for normal frame predictions and red bounding boxes for abnormal frame predictions), illustrating the relevance among the predicted anomaly score, visual content, and description. Ground-truth anomalies are highlighted.
  • ...and 2 more figures