Table of Contents
Fetching ...

Context-aware Video Anomaly Detection in Long-Term Datasets

Zhengye Yang, Richard Radke

TL;DR

This paper tackles video anomaly detection in long-term, context-rich environments by introducing Trinity, a context-aware VAD framework that learns joint embeddings across context, appearance, and motion through global and local contrastive alignments. By integrating a context branch, a motion branch with vector-quantized flow tokens, and an appearance branch based on a U-net, Trinity detects context-dependent anomalies and also remains capable of addressing context-free anomalies on standard benchmarks. The authors contribute a new WF long-term dataset with rich contextual metadata, plus a Biker Day variant for benchmark-style evaluation, and demonstrate that context-aware global alignment substantially improves anomaly detection, especially for out-of-context events. Their approach offers practical impact for real-world camera networks by enabling robust detection of behavior that is anomalous only within a given temporal or scheduled context, and it provides a foundation for further research into long-term, context-sensitive video understanding.

Abstract

Video anomaly detection research is generally evaluated on short, isolated benchmark videos only a few minutes long. However, in real-world environments, security cameras observe the same scene for months or years at a time, and the notion of anomalous behavior critically depends on context, such as the time of day, day of week, or schedule of events. Here, we propose a context-aware video anomaly detection algorithm, Trinity, specifically targeted to these scenarios. Trinity is especially well-suited to crowded scenes in which individuals cannot be easily tracked, and anomalies are due to speed, direction, or absence of group motion. Trinity is a contrastive learning framework that aims to learn alignments between context, appearance, and motion, and uses alignment quality to classify videos as normal or anomalous. We evaluate our algorithm on both conventional benchmarks and a public webcam-based dataset we collected that spans more than three months of activity.

Context-aware Video Anomaly Detection in Long-Term Datasets

TL;DR

This paper tackles video anomaly detection in long-term, context-rich environments by introducing Trinity, a context-aware VAD framework that learns joint embeddings across context, appearance, and motion through global and local contrastive alignments. By integrating a context branch, a motion branch with vector-quantized flow tokens, and an appearance branch based on a U-net, Trinity detects context-dependent anomalies and also remains capable of addressing context-free anomalies on standard benchmarks. The authors contribute a new WF long-term dataset with rich contextual metadata, plus a Biker Day variant for benchmark-style evaluation, and demonstrate that context-aware global alignment substantially improves anomaly detection, especially for out-of-context events. Their approach offers practical impact for real-world camera networks by enabling robust detection of behavior that is anomalous only within a given temporal or scheduled context, and it provides a foundation for further research into long-term, context-sensitive video understanding.

Abstract

Video anomaly detection research is generally evaluated on short, isolated benchmark videos only a few minutes long. However, in real-world environments, security cameras observe the same scene for months or years at a time, and the notion of anomalous behavior critically depends on context, such as the time of day, day of week, or schedule of events. Here, we propose a context-aware video anomaly detection algorithm, Trinity, specifically targeted to these scenarios. Trinity is especially well-suited to crowded scenes in which individuals cannot be easily tracked, and anomalies are due to speed, direction, or absence of group motion. Trinity is a contrastive learning framework that aims to learn alignments between context, appearance, and motion, and uses alignment quality to classify videos as normal or anomalous. We evaluate our algorithm on both conventional benchmarks and a public webcam-based dataset we collected that spans more than three months of activity.
Paper Structure (23 sections, 8 equations, 9 figures, 4 tables)

This paper contains 23 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Snapshots at the same time of day at a baseball stadium. Left: A typical scenario during a non-game day; Middle: A typical scenario during a game day; Right: Unexpected group presence.
  • Figure 2: Pipeline for the proposed Trinity algorithm, which takes video frames, optical flow, and contextual information as inputs. The Trinity branches extract global and local representation from each input stream. Global and local alignment are used to learn joint embeddings at different scales and later used to determine anomalies by evaluating misalignments between branches.
  • Figure 4: Example of pseudo context anomaly evaluation. The algorithm is expected to identify whether the given context matches the input video. TN, TD, FA, FN are true negative, true detection, false alarm and false negative respectively.
  • Figure 4: Real context anomalies in the WF dataset. These two sample videos contain unexpected crowds at the front of the stadium (presence anomalies). The blue line is the prediction result and the orange line is the ground truth.
  • Figure 5: Selected results of pseudo anomalies in the WF dataset. Top: Selected pseudo anomaly detection results with exemplar frames. The orange line indicates ground truth and the blue line is corresponding prediction. The caption indicates the change from the true context (left of the arrow) to the pseudo context (right of the arrow). The parentheses indicate the type of context anomaly.
  • ...and 4 more figures