Table of Contents
Fetching ...

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai

TL;DR

VADER addresses semantic interpretability in video anomaly detection by proposing Video Anomaly Understanding (VAU) and fusing visual cues with relational signals via an LLM-driven framework to generate causal anomaly narratives. The method combines an Anomaly Scorer for per-frame scores and a Context-AwarE Sampling (CAES) strategy to capture causal context, plus a DETR-based scene-graph relational feature extractor and CORE to map relational changes into compact tokens. These tokens are integrated into a pretrained multimodal LLM, with fine-tuning limited to projection layers and LoRA adapters, enabling causally grounded descriptions and anomaly-related question answering. Empirical evaluation on HIVAU-70k, HAWK, and CUVA benchmarks shows strong performance in description, explanation, and causal reasoning, with ablations indicating CORE and CAES are essential; limitations include reliance on upstream modules and bias toward high-motion events.

Abstract

Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

TL;DR

VADER addresses semantic interpretability in video anomaly detection by proposing Video Anomaly Understanding (VAU) and fusing visual cues with relational signals via an LLM-driven framework to generate causal anomaly narratives. The method combines an Anomaly Scorer for per-frame scores and a Context-AwarE Sampling (CAES) strategy to capture causal context, plus a DETR-based scene-graph relational feature extractor and CORE to map relational changes into compact tokens. These tokens are integrated into a pretrained multimodal LLM, with fine-tuning limited to projection layers and LoRA adapters, enabling causally grounded descriptions and anomaly-related question answering. Empirical evaluation on HIVAU-70k, HAWK, and CUVA benchmarks shows strong performance in description, explanation, and causal reasoning, with ablations indicating CORE and CAES are essential; limitations include reliance on upstream modules and bias toward high-motion events.

Abstract

Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.

Paper Structure

This paper contains 30 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Concept of VADER. VADER enables detailed anomaly understanding by extracting key visual and relational cues from selected key frames.
  • Figure 2: Overview of VADER framework. Given an input video, the Anomaly Scorer and Context-AwarE Sampling (CAES) identify keyframes for narrative-driven anomaly analysis. Visual and relational features are extracted and encoded, with dynamic relational patterns distilled by the COntrastive Relation Encoder (CORE). All cues are fused by a pretrained LLM for comprehensive video anomaly understanding. The right panel illustrates the relational branch, including temporal association, volatility mining, and contrastive token learning.
  • Figure 3: Illustration of CAES keyframe selection strategy. The anomaly score curve is segmented into pre-event (blue), on-event (yellow), and post-event (green) intervals. Blue dots are sampled keyframes, and red dots indicate rise and calm thresholds.
  • Figure 4: Qualitative results of VADER's causal reasoning capabilities. Given a video depicting a nighttime break-in, VADER generates detailed, context-aware answers for a sequence of causal reasoning queries, capturing both the key actions and underlying cause-effect chains within the event.
  • Figure 5: Three examples of the task of describing anonymous videos are depicted here. The descriptions generated by Otter li2025otter and Video-ChatGPT Maaz2023VideoChatGPT contain hallucination or incorrect analysis. In contrast, VADER produces concise, contextually grounded, and causally coherent descriptions that accurately reflect the events and their underlying dynamics across various challenging cases.
  • ...and 4 more figures