Table of Contents
Fetching ...

TRACES: Temporal Recall with Contextual Embeddings for Real-Time Video Anomaly Detection

Yousuf Ahmed Siddiqui, Sufiyaan Usmani, Umer Tariq, Jawwad Ahmed Shamsi, Muhammad Burhan Khan

TL;DR

TRACE addresses the challenge of context-aware zero-shot video anomaly detection by integrating motion and appearance through temporal cross-attention and a large Traces Bank of contextual embeddings. The method preserves frozen large encoders, adds lightweight adapters and cross-modal fusion, and uses retrieval over text-derived context traces to score anomalies without labeled anomaly data. It achieves state-of-the-art zero-shot performance on UCF-Crime and XD-Violence with real-time inference and interpretable cross-attention explanations, demonstrating strong generalization to unseen events. The work advances practical surveillance deployment by combining context recall with open-set anomaly reasoning and providing robust, low-latency detection in real-world settings.

Abstract

Video anomalies often depend on contextual information available and temporal evolution. Non-anomalous action in one context can be anomalous in some other context. Most anomaly detectors, however, do not notice this type of context, which seriously limits their capability to generalize to new, real-life situations. Our work addresses the context-aware zero-shot anomaly detection challenge, in which systems need to learn adaptively to detect new events by correlating temporal and appearance features with textual traces of memory in real time. Our approach defines a memory-augmented pipeline, correlating temporal signals with visual embeddings using cross-attention, and real-time zero-shot anomaly classification by contextual similarity scoring. We achieve 90.4\% AUC on UCF-Crime and 83.67\% AP on XD-Violence, a new state-of-the-art among zero-shot models. Our model achieves real-time inference with high precision and explainability for deployment. We show that, by fusing cross-attention temporal fusion and contextual memory, we achieve high fidelity anomaly detection, a step towards the applicability of zero-shot models in real-world surveillance and infrastructure monitoring.

TRACES: Temporal Recall with Contextual Embeddings for Real-Time Video Anomaly Detection

TL;DR

TRACE addresses the challenge of context-aware zero-shot video anomaly detection by integrating motion and appearance through temporal cross-attention and a large Traces Bank of contextual embeddings. The method preserves frozen large encoders, adds lightweight adapters and cross-modal fusion, and uses retrieval over text-derived context traces to score anomalies without labeled anomaly data. It achieves state-of-the-art zero-shot performance on UCF-Crime and XD-Violence with real-time inference and interpretable cross-attention explanations, demonstrating strong generalization to unseen events. The work advances practical surveillance deployment by combining context recall with open-set anomaly reasoning and providing robust, low-latency detection in real-world settings.

Abstract

Video anomalies often depend on contextual information available and temporal evolution. Non-anomalous action in one context can be anomalous in some other context. Most anomaly detectors, however, do not notice this type of context, which seriously limits their capability to generalize to new, real-life situations. Our work addresses the context-aware zero-shot anomaly detection challenge, in which systems need to learn adaptively to detect new events by correlating temporal and appearance features with textual traces of memory in real time. Our approach defines a memory-augmented pipeline, correlating temporal signals with visual embeddings using cross-attention, and real-time zero-shot anomaly classification by contextual similarity scoring. We achieve 90.4\% AUC on UCF-Crime and 83.67\% AP on XD-Violence, a new state-of-the-art among zero-shot models. Our model achieves real-time inference with high precision and explainability for deployment. We show that, by fusing cross-attention temporal fusion and contextual memory, we achieve high fidelity anomaly detection, a step towards the applicability of zero-shot models in real-world surveillance and infrastructure monitoring.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Proposed framework for anomaly detection.
  • Figure 2: Architecture integrates CLIP-ViT for visual-language representation, with up/down adapter blocks and Temporal self-attention to capture sequence dynamics, while cross-attention fusion aligns multi-modal features for contextual anomaly reasoning.
  • Figure 3: Simplified scheme of the proposed trace retrieval framework, showing how a query embedding is compared against context-specific trace vectors in the Trace Bank.
  • Figure 4: t-SNE visualization of clustered trace embeddings from the Traces Bank. Six distinct context clusters are observed, each exhibiting different distributions of anomalous (red) and non-anomalous (blue) vectors.
  • Figure 5: Qualitative visualization of cross-attention interpretability on the XD-Violence dataset. The top frame shows a non-anomalous instance, while the bottom frame shows an anomalous event. Grad-CAM-inspired cross-attention heatmaps emphasize the spatial and temporal regions most influential to the model’s zero-shot reasoning.