Table of Contents
Fetching ...

Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs

Hari Lee

TL;DR

TbVAD tackles video anomaly detection under weak supervision by shifting the entire reasoning pipeline to textual representations. It introduces a three-branch architecture: a Structured Knowledge Branch that creates multi-aspect textual priors from captions, a Text Understanding Branch that encodes fine-grained captions, and an Explainable Reasoning Branch that yields slot-wise importance, retrieved evidences, and natural-language explanations. Experiments on UCF-Crime and XD-Violence show competitive performance relative to vision-based baselines while offering enhanced interpretability through knowledge-grounded explanations. Ablation and cross-dataset analyses reveal the value of four semantic slots (context, action, object, environment) for robust, generalizable anomaly reasoning in real-world surveillance scenarios.

Abstract

We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.

Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs

TL;DR

TbVAD tackles video anomaly detection under weak supervision by shifting the entire reasoning pipeline to textual representations. It introduces a three-branch architecture: a Structured Knowledge Branch that creates multi-aspect textual priors from captions, a Text Understanding Branch that encodes fine-grained captions, and an Explainable Reasoning Branch that yields slot-wise importance, retrieved evidences, and natural-language explanations. Experiments on UCF-Crime and XD-Violence show competitive performance relative to vision-based baselines while offering enhanced interpretability through knowledge-grounded explanations. Ablation and cross-dataset analyses reveal the value of four semantic slots (context, action, object, environment) for robust, generalizable anomaly reasoning in real-world surveillance scenarios.

Abstract

We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of captions generated by various VLMs. Molmo demonstrates the most descriptive and domain-appropriate outputs for surveillance imagery.
  • Figure 2: Overview of knowledge generation. Our pipeline first samples $K$ evenly spaced frames from a given video, and uses a frozen vision-language model (VLM) to generate fine-grained captions. These frame-level descriptions are aggregated across the video to form comprehensive textual summaries, grouped by video class: $D_n$ for normal videos and $D_a$ for abnormal ones. Using a large language model (LLM), we perform multi-aspect summarization with four prompts—$P_c$, $P_a$, $P_o$, and $P_e$—designed to extract context, action, object, and environmental information, respectively. The resulting structured knowledge is represented as $K_n$ and $K_a$, which encode multi-dimensional textual features of normal and abnormal scenarios. This process transforms raw visual content into interpretable and structured semantic representations for downstream anomaly detection.
  • Figure 3: Overview of the proposed TbVAD architecture. The framework consists of three main branches: (1) a Text Understanding Branch that encodes frame-level fine-grained captions using a trainable transformer encoder, (2) a Structured Knowledge Branch that constructs multi-aspect textual priors (context, action, object, environment) from video captions via LLM-based summarization, and (3) an Explainable Reasoning Branch that computes slot-wise importance through attention-based alignment and generates human-interpretable textual explanations. The representations from the first two branches are fused through a classification head to predict the anomaly score and label, while the reasoning branch provides slot-grounded evidence supporting each prediction.
  • Figure 4: Visualization of TbVAD’s knowledge-grounded explanations for three representative abnormal events: (a) Explosion, (b) Riot, and (c) Car Accident.
  • Figure 5: Visualization of slot-wise importance and counterfactual reasoning in TbVAD. (a–c) show slot importance ($w_s$) for Explosion, Riot, and Car Accident samples, respectively. (d–f) visualize the corresponding counterfactual slot margins ($\Delta_s = w_s - w^{cf}_s$), illustrating how each slot’s contribution changes when the prediction is inverted. Blue bars denote the top-2 influential slots; gray bars represent minor ones.