Table of Contents
Fetching ...

Text-guided Fine-Grained Video Anomaly Detection

Jihao Gu, Kun Li, He Wang, Kaan Akşit

TL;DR

TVAD addresses the need for fine-grained, interpretable VAD by using an Anomaly Heatmap Decoder to produce pixel-level heatmaps and a Region-aware Anomaly Encoder to inject region- and motion-aware evidence into a frozen LVLM; this enables interactive, multi-turn reasoning about anomalies without manual thresholds. The approach demonstrates state-of-the-art micro-AUC on UBnormal and substantial improvements in anomaly localization and textual faithfulness on ShanghaiTech, with strong BLEU-4 and Yes/No accuracy for target appearance and motion. A curated frame-wise to timeline dataset pipeline supports training RAE with aligned video-text supervision, strengthening cross-modal localization and description. Overall, TVAD delivers precise localization, robust interpretation, and practical interactivity for real-world VAD tasks.

Abstract

Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

Text-guided Fine-Grained Video Anomaly Detection

TL;DR

TVAD addresses the need for fine-grained, interpretable VAD by using an Anomaly Heatmap Decoder to produce pixel-level heatmaps and a Region-aware Anomaly Encoder to inject region- and motion-aware evidence into a frozen LVLM; this enables interactive, multi-turn reasoning about anomalies without manual thresholds. The approach demonstrates state-of-the-art micro-AUC on UBnormal and substantial improvements in anomaly localization and textual faithfulness on ShanghaiTech, with strong BLEU-4 and Yes/No accuracy for target appearance and motion. A curated frame-wise to timeline dataset pipeline supports training RAE with aligned video-text supervision, strengthening cross-modal localization and description. Overall, TVAD delivers precise localization, robust interpretation, and practical interactivity for real-world VAD tasks.

Abstract

Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

Paper Structure

This paper contains 18 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The proposed TVAD model. The framework consists of three modules: a Text Encoder ($E_t$) that generates class-specific text embeddings $\mathbf{S}_c$ from binary prompts; an Anomaly Heatmap Decoder (AHD) that fuses $\mathbf{S}_c$ with visual features $\mathcal{V}$ to produce a spatiotemporal anomaly map $\mathbf{H}$; and a Region-aware Anomaly Encoder (RAE) that projects $\mathbf{H}_c$ into the LoRA-tuned LVLM semantic space and integrates it with a video $\mathcal{V}$ and a sequence of incrementally refined question $\mathbf{Q}_{\leq t}$ to yield the final anomaly detection response $\mathbf{A}_t$.
  • Figure 2: Examples of interpretable anomaly detection and multi-turn QA across scenes. Each group shows the raw frame, the pixel-level anomaly heatmap produced by AHD, and TVAD’s dialogue outputs (anomaly yes/no, appearance/action details, and motion trajectory). Left: a cyclist (with umbrella and backpack) is localized as the anomalous target, with the trajectory "enter from right $\rightarrow$ turn toward the upper-right corner $\rightarrow$ exit." Right: a silver SUV suddenly appears from the left and moves rapidly; AHD highlights the vehicle consistently over time, and the QA module explains the abrupt appearance and fast motion. TVAD first detects the anomaly, then describes the appearance (white top, grey shorts, green schoolbag) and the change from walking to running. Red arrows indicate main motion directions; heatmap intensity reflects anomaly confidence. RAE encodes the heatmaps into region-aware text prompts that guide the LVLM to produce consistent decisions and descriptions, closing the loop from pixel-level evidence to readable narratives.
  • Figure 3: Trajectory visualization by accumulating frame-level outputs. The first row shows multi-frame overlays of the original video with green bounding boxes for GT and red bounding boxes for predictions. The second row overlays GT pixel-level masks to form fine-grained trajectories, while the third row overlays predicted pixel-level masks. Both bounding-box and pixel-level trajectories exhibit strong spatial alignment with GT, indicating that our model accurately captures motion paths across time.