Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model
Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, Yanning Zhang
TL;DR
The paper defines Video Anomaly Retrieval (VAR) to enable cross-modal search of long, untrimmed videos using detailed captions or synchronized audio. It introduces two large-scale benchmarks, UCFCrime-AR (video-text) and XDViolence-AR (video-audio), based on existing VAD data, and presents ALAN, a Transformer-based framework with anomaly-led sampling, a video-text-focused VPMPM pretext task, and dual cross-modal alignments. Experimental results show that ALAN achieves strong gains over state-of-the-art retrieval methods on both benchmarks, highlighting the usefulness of focusing on anomalous segments and fine-grained cross-modal semantics. The work advances practical video understanding by bridging VAD with retrieval demands, offering datasets and a model that support scalable, cross-modal anomaly search in real-world surveillance and media analysis scenarios.
Abstract
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR.
