Table of Contents
Fetching ...

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun

TL;DR

The paper tackles video anomaly understanding by introducing VAU-R1, a data-efficient reinforcement-fine-tuning framework that leverages GRPO to enhance multimodal LLM reasoning across four VAU tasks: perception, grounding, reasoning, and conclusion. It paired VAU-R1 with VAU-Bench, the first chain-of-thought enhanced benchmark for video anomaly reasoning, enabling rich annotations, QA, temporal localization, and reasoning rationales. Empirical results show that RFT improves QA accuracy, temporal grounding, and reasoning quality over supervised fine-tuning, with better generalization across datasets, though chain-of-thought prompts yield mixed effects on some tasks. The work offers a unified evaluation protocol and a scalable path toward interpretable, reasoning-aware VAU with potential applications in safety-critical surveillance and disaster response.

Abstract

Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

TL;DR

The paper tackles video anomaly understanding by introducing VAU-R1, a data-efficient reinforcement-fine-tuning framework that leverages GRPO to enhance multimodal LLM reasoning across four VAU tasks: perception, grounding, reasoning, and conclusion. It paired VAU-R1 with VAU-Bench, the first chain-of-thought enhanced benchmark for video anomaly reasoning, enabling rich annotations, QA, temporal localization, and reasoning rationales. Empirical results show that RFT improves QA accuracy, temporal grounding, and reasoning quality over supervised fine-tuning, with better generalization across datasets, though chain-of-thought prompts yield mixed effects on some tasks. The work offers a unified evaluation protocol and a scalable path toward interpretable, reasoning-aware VAU with potential applications in safety-critical surveillance and disaster response.

Abstract

Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

Paper Structure

This paper contains 16 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Effectiveness of Reinforcement Fine-Tuning. We compare QA accuracy and temporal anomaly grounding performance across different models. VAU-R1, trained via Reinforcement Fine-Tuning (RFT), consistently outperforms its Supervised Fine-Tuning (SFT) counterpart. This demonstrates that RFT enhances both reasoning and temporal localization capabilities in VAU tasks.
  • Figure 2: Overview of VAU-R1. VAU-R1 leverages Reinforcement Fine-Tuning to enhance the reasoning ability of MLLMs for video anomaly understanding. Specifically, we adopt Group Relative Policy Optimization (GRPO) to optimize the model with task-specific rewards, such as answer format, accuracy, and temporal Intersection-over-Union (IoU). We decompose the VAU task into four complementary tasks to facilitate comprehensive reasoning: multiple-choice QA, temporal anomaly grounding, anomaly reasoning, and anomaly classification.
  • Figure 3: Statistics of our VAU‐Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.
  • Figure 4: Qualitative case of the QA (top) and TAG (bottom) task. All ground-truths and correct answers are highlighted in orange. Both SFT and RFT perform inference using the same CoT prompt. RFT’s explicit chain-of-thought yields precise, interpretable QA choice and anomaly interval, whereas SFT’s output is less informative and tends to produce inaccurate responses.
  • Figure 5: More dataset statistics of our VAU‐Bench. (a) Distribution of training, validation, and test splits across the four tasks included in VAU-Bench. (b) Word cloud visualization of frequent terms appearing in the multiple-choice questions and choices.
  • ...and 4 more figures