Table of Contents
Fetching ...

Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

TL;DR

Pistachio presents a fully synthetic, balanced, long-form benchmark for video anomaly detection and understanding, built via a controllable generation pipeline that yields 41-second narratives with rich scene and anomaly diversity. The approach combines scene-conditioned anomaly assignment, multi-step storyline generation, temporally coherent long-form synthesis, and hybrid filtering to produce scalable ground truth with minimal human effort, including VAU annotations generated automatically from storyline descriptions. Empirical results show Pistachio poses new challenges for existing VAD/VAU methods, with vision-language and large-language-model–based approaches offering the strongest generalization, while also highlighting limitations of current architectures in long-horizon reasoning. The work also provides a reusable data-generation toolkit and prompts for researchers to create customized benchmarks, aiming to accelerate progress in both VAD and VAU, and to enable open-ended, semantically rich anomaly understanding in realistic settings.

Abstract

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

TL;DR

Pistachio presents a fully synthetic, balanced, long-form benchmark for video anomaly detection and understanding, built via a controllable generation pipeline that yields 41-second narratives with rich scene and anomaly diversity. The approach combines scene-conditioned anomaly assignment, multi-step storyline generation, temporally coherent long-form synthesis, and hybrid filtering to produce scalable ground truth with minimal human effort, including VAU annotations generated automatically from storyline descriptions. Empirical results show Pistachio poses new challenges for existing VAD/VAU methods, with vision-language and large-language-model–based approaches offering the strongest generalization, while also highlighting limitations of current architectures in long-horizon reasoning. The work also provides a reusable data-generation toolkit and prompts for researchers to create customized benchmarks, aiming to accelerate progress in both VAD and VAU, and to enable open-ended, semantically rich anomaly understanding in realistic settings.

Abstract

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

Paper Structure

This paper contains 32 sections, 3 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: We introduce Pistachio - a benchmark for video anomaly analysis, which aims at two fundamental tasks: Video Anomaly Detection (VAD) and Video Anomaly Understanding (VAU). The VAD dataset totals 1.6 million frames and extends existing datasets by expanding the number of scenes from hundreds to thousands, covering 31 distinct anomaly types, over half of which are unique to this benchmark. Pistachio offers multi-granularity annotations at both the event and video levels. The entire benchmark was produced via a highly automated pipeline.
  • Figure 2: Overview of our video anomaly dataset generation pipeline. Step 1 (Storyline Generation): A VLM-based scene classifier processes input images to identify scenes. The Anomaly Type Allocator then assigns appropriate anomaly types (e.g., Fire) to generate coherent storylines across six scenes by LLMs. Step 2 (Event-to-Video Summary): The storyline is decomposed into event summaries, with each event described by detailed prompts specifying camera angles, actions, and temporal progression (Prompts 1-7). Step 3 (Storyline-to-Video): An image-to-video generation model synthesizes coherent video sequences from the event summaries, producing temporally consistent frames (frames 1-567) that maintain narrative continuity across the entire storyline.
  • Figure 3: Distribution of anomaly videos across different anomaly types. For each type, the left bar represents short videos and the right bar represents long videos. The rightmost single bars indicate multi-anomaly videos, which are exclusively long-form.
  • Figure 4: Distribution of anomaly video ratios in each scenario
  • Figure 5: Comparison of Different Video Generation Schemes and Non-compliant Videos.
  • ...and 1 more figures