Text-guided Fine-Grained Video Anomaly Detection
Jihao Gu, Kun Li, He Wang, Kaan Akşit
TL;DR
TVAD addresses the need for fine-grained, interpretable VAD by using an Anomaly Heatmap Decoder to produce pixel-level heatmaps and a Region-aware Anomaly Encoder to inject region- and motion-aware evidence into a frozen LVLM; this enables interactive, multi-turn reasoning about anomalies without manual thresholds. The approach demonstrates state-of-the-art micro-AUC on UBnormal and substantial improvements in anomaly localization and textual faithfulness on ShanghaiTech, with strong BLEU-4 and Yes/No accuracy for target appearance and motion. A curated frame-wise to timeline dataset pipeline supports training RAE with aligned video-text supervision, strengthening cross-modal localization and description. Overall, TVAD delivers precise localization, robust interpretation, and practical interactivity for real-world VAD tasks.
Abstract
Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).
