Table of Contents
Fetching ...

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, Yanning Zhang

TL;DR

The paper defines Video Anomaly Retrieval (VAR) to enable cross-modal search of long, untrimmed videos using detailed captions or synchronized audio. It introduces two large-scale benchmarks, UCFCrime-AR (video-text) and XDViolence-AR (video-audio), based on existing VAD data, and presents ALAN, a Transformer-based framework with anomaly-led sampling, a video-text-focused VPMPM pretext task, and dual cross-modal alignments. Experimental results show that ALAN achieves strong gains over state-of-the-art retrieval methods on both benchmarks, highlighting the usefulness of focusing on anomalous segments and fine-grained cross-modal semantics. The work advances practical video understanding by bridging VAD with retrieval demands, offering datasets and a model that support scalable, cross-modal anomaly search in real-world surveillance and media analysis scenarios.

Abstract

Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR.

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

TL;DR

The paper defines Video Anomaly Retrieval (VAR) to enable cross-modal search of long, untrimmed videos using detailed captions or synchronized audio. It introduces two large-scale benchmarks, UCFCrime-AR (video-text) and XDViolence-AR (video-audio), based on existing VAD data, and presents ALAN, a Transformer-based framework with anomaly-led sampling, a video-text-focused VPMPM pretext task, and dual cross-modal alignments. Experimental results show that ALAN achieves strong gains over state-of-the-art retrieval methods on both benchmarks, highlighting the usefulness of focusing on anomalous segments and fine-grained cross-modal semantics. The work advances practical video understanding by bridging VAD with retrieval demands, offering datasets and a model that support scalable, cross-modal anomaly search in real-world surveillance and media analysis scenarios.

Abstract

Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR.
Paper Structure (21 sections, 10 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 10 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: VAD vs. VAR. Single labels may be unable to describe sequential anomalous events in VAD, but text captions or synchronous audios can sufficiently depict events in VAR.
  • Figure 2: Comparison of VAR with video retrieval and video moment retrieval.
  • Figure 3: Statistical histogram distributions on UCFCrime-AR. Left: text captions in English; Right: text captions in Chinese.
  • Figure 4: Overview of our ALAN. It consists of several components, i.e., video encoder, text encoder, audio encoder, pretext task VPMPM, and cross-modal alignment.
  • Figure 5: Influences of $\alpha$ on both UCFCrime-AR and XDViolence-AR.
  • ...and 5 more figures