Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Yuchen Yang; Kwonjoon Lee; Behzad Dariush; Yinzhi Cao; Shao-Yuan Lo

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, Shao-Yuan Lo

TL;DR

This work tackles the lack of explainability in video anomaly detection by introducing AnomalyRuler, a rule-based reasoning framework that uses few-normal-shot prompting to induce robust anomaly rules via induction and apply them in deduction. By leveraging a vision-language model for frame descriptions and LLMs with carefully designed prompts, the approach achieves state-of-the-art detection performance and demonstrable reasoning capability across four benchmarks, with strong domain adaptability. The method eliminates the need for full-shot training, enabling fast adaptation to new VAD scenarios while providing interpretable rules and a reasoning trace. Open-sourced and shown to outperform several LLM-based baselines, AnomalyRuler offers a practical path toward trustworthy, explainable VAD in real-world deployments.

Abstract

Video Anomaly Detection (VAD) is crucial for applications such as security surveillance and autonomous driving. However, existing VAD methods provide little rationale behind detection, hindering public trust in real-world deployments. In this paper, we approach VAD with a reasoning framework. Although Large Language Models (LLMs) have shown revolutionary reasoning ability, we find that their direct use falls short of VAD. Specifically, the implicit knowledge pre-trained in LLMs focuses on general context and thus may not apply to every specific real-world VAD scenario, leading to inflexibility and inaccuracy. To address this, we propose AnomalyRuler, a novel rule-based reasoning framework for VAD with LLMs. AnomalyRuler comprises two main stages: induction and deduction. In the induction stage, the LLM is fed with few-shot normal reference samples and then summarizes these normal patterns to induce a set of rules for detecting anomalies. The deduction stage follows the induced rules to spot anomalous frames in test videos. Additionally, we design rule aggregation, perception smoothing, and robust reasoning strategies to further enhance AnomalyRuler's robustness. AnomalyRuler is the first reasoning approach for the one-class VAD task, which requires only few-normal-shot prompting without the need for full-shot training, thereby enabling fast adaption to various VAD scenarios. Comprehensive experiments across four VAD benchmarks demonstrate AnomalyRuler's state-of-the-art detection performance and reasoning ability. AnomalyRuler is open-source and available at: https://github.com/Yuchen413/AnomalyRuler

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 3 figures, 11 tables)

This paper contains 45 sections, 1 equation, 3 figures, 11 tables.

Introduction
Related Work
Video Anomaly Detection.
Large Language Models.
Induction
Visual Perception
Rule Generation
Normal and Anomaly.
Abstract and Concrete.
Human and Environment.
Rule Aggregation
Deduction
Perception Smoothing
Initial Anomaly Matching.
Exponential Majority Smoothing.
...and 30 more sections

Figures (3)

Figure 1: Comparison of one-class VAD approaches. In this specific safety application example, only "walking" is normal. The test frame contains "skateboarding", so it is abnormal. (a) Traditional methods require full-shot training and only output anomaly scores, lacking reasoning. (b) Direct LLM use may not align with specific VAD needs. Here GPT-4V mistakenly treats "skateboarding" as normal. (c) Our AnomalyRuler has induction and deduction stages. It derives rules from few-shot normal reference frames to detect anomalies, correctly identifying "skateboarding" as an anomaly.
Figure 2: The AnomalyRuler pipeline consists of two main stages: induction and deduction. The induction stage involves: i) visual perception transfers normal reference frames to text descriptions; ii) rule generation derives rules based on these descriptions to determine normality and anomaly; iii) rule aggregation employs a voting mechanism to mitigate errors in rules. The deduction stage involves: i) visual perception transfers continuous frames to descriptions; ii) perception smoothing adjusts these descriptions considering temporal consistency to ensure neighboring frames share similar characteristics; iii) robust reasoning rechecks the previous dummy answers and outputs reasoning.
Figure 3: Ablation on hyperparameters of the (a) (b) rule aggregation and (c) perception smoothing modules on the ShT dataset.

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

TL;DR

Abstract

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)