Table of Contents
Fetching ...

CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Yating Yu, Congqi Cao, Zhaoying Wang, Weihua Meng, Jie Li, Yuxin Li, Zihao Wei, Zhongpei Shen, Jiajun Zhang

TL;DR

This work introduces CueBench, the first large-scale benchmark for context-aware video anomaly understanding (VAU) in real-world settings, featuring a five-task unified evaluation and a rich event-centric taxonomy of absolute and conditional anomalies across 174 scenes and 198 attributes. It couples CueBench with Cue-R1, a unified generative model trained via supervised and reinforcement fine-tuning using verifiable, task-aligned rewards within a GRPO framework, enabling structured, hierarchical, and temporally aware VAU reasoning. Across extensive experiments, Cue-R1 consistently outperforms both generative and specialized vision-language models, revealing substantial gaps in current VAU capabilities while demonstrating the practicality and effectiveness of a unified, context-aware approach. The work advances VAU by providing a rigorous benchmark and a robust, interpretable method capable of open-world anomaly understanding, with implications for safer and more intelligent real-world vision systems.

Abstract

How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

TL;DR

This work introduces CueBench, the first large-scale benchmark for context-aware video anomaly understanding (VAU) in real-world settings, featuring a five-task unified evaluation and a rich event-centric taxonomy of absolute and conditional anomalies across 174 scenes and 198 attributes. It couples CueBench with Cue-R1, a unified generative model trained via supervised and reinforcement fine-tuning using verifiable, task-aligned rewards within a GRPO framework, enabling structured, hierarchical, and temporally aware VAU reasoning. Across extensive experiments, Cue-R1 consistently outperforms both generative and specialized vision-language models, revealing substantial gaps in current VAU capabilities while demonstrating the practicality and effectiveness of a unified, context-aware approach. The work advances VAU by providing a rigorous benchmark and a robust, interpretable method capable of open-world anomaly understanding, with implications for safer and more intelligent real-world vision systems.

Abstract

How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

Paper Structure

This paper contains 42 sections, 8 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: Comparison of existing benchmarks. (a) Traditional VAD aims to detect deviations from normal patterns and identify the time window of the occurring anomaly, yet exhibiting insufficient comprehension of subtle anomalies and lacking context-awareness (e.g., cyclist jaywalking while crossing road). (b) Current VAU benchmarks primarily emphasize the interpretation of absolutely anomalous events with explainable outputs. (c) Our large-scale CueBench features a diverse collection of context-aware anomalies and normalities from real-world scenarios, organized within a comprehensive hierarchical taxonomy, and supports unified evaluation across five challenging VAU tasks.
  • Figure 2: Data statistics of CueBench. (a) We comprehensively build an event-centric hierarchical taxonomy that covers 2 states, 3 domains, and 9 effects in a top-down manner. And (b) our CueBench exhibits a diverse spectrum of conditional anomalies, normalities, and absolute anomalies across both the training and test splits.
  • Figure 3: Evaluation framework with task examples of CueBench. Our benchmark advances the evaluation of five challenging context-aware VAU tasks in a unified generative manner, by prompting the generative VLMs with videos and task-related problems. The VLMs are required to respond accordingly in a JSON-style format rather than free-texts. This enables accurate evaluation of various tasks for generative VLMs by checking the answers with ground-truths.
  • Figure 4: Case Study. Comparisons with Qwen2.5-VL-3B and Cue-R1 on context-aware anomaly recognition and detection.
  • Figure 5: Word cloud of the contexts in CueBench.
  • ...and 7 more figures