Table of Contents
Fetching ...

Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

Hyogun Lee, Haksub Kim, Ig-Jae Kim, Yonghun Choi

TL;DR

Flashback tackles the need for zero-shot, real-time video anomaly detection by offline memory construction using a frozen LLM and online retrieval via a cross-modal encoder. It introduces repulsive prompting and scaled anomaly penalization to reduce embedding bias and improve discrimination, producing per-segment anomaly scores and human-readable captions without online LLM calls. Across UCF-Crime and XD-Violence, it achieves state-of-the-art zero-shot performance and real-time throughput on consumer GPUs, outperforming baselines across AUC and AP metrics. The approach offers practical, explainable VAD suitable for large-scale surveillance while highlighting potential biases and areas for future work.

Abstract

Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, two fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.

Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

TL;DR

Flashback tackles the need for zero-shot, real-time video anomaly detection by offline memory construction using a frozen LLM and online retrieval via a cross-modal encoder. It introduces repulsive prompting and scaled anomaly penalization to reduce embedding bias and improve discrimination, producing per-segment anomaly scores and human-readable captions without online LLM calls. Across UCF-Crime and XD-Violence, it achieves state-of-the-art zero-shot performance and real-time throughput on consumer GPUs, outperforming baselines across AUC and AP metrics. The approach offers practical, explainable VAD suitable for large-scale surveillance while highlighting potential biases and areas for future work.

Abstract

Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, two fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Bridging speed and reasoning. (a) Real-time VAD keeps a light video encoder online but cannot work zero-shot or explain its decisions. (b) Explainable VAD adds a large VLM + LLM in the loop; reasoning is possible, yet speed drops and zero-shot ability is partial. (c) Flashback moves the LLM offline, builds a pseudo-scene memory once, and uses a frozen cross-modal encoder at test time, so it is simultaneously real-time, zero-shot, and explainable. (d) On XD-Violence Wu2020not (XD), this design lifts AP by 13 percentage point and boosts throughput 34$\times$ over the prior state-of-the-art.
  • Figure 2: Overview of Flashback. Flashback operates in two disjoint stages. Offline Recall: a frozen LLM openai_gpt4o generates a diverse set of normal and anomalous scene sentences using context and format prompts $\mathtt{P}_\text{C},\mathtt{P}_\text{F}$, which are embedded by a frozen video-text encoder and stored in a million-entry Pseudo-Scene Memory$\mathcal{C}_\text{N}, \mathcal{C}_\text{A}$. Repulsive Prompting widens the separation between normal and anomalous embeddings, countering the encoder's bias. Online Respond: we embed each incoming segment $V_s$, retrieve its top-$K$ matches from the memory, and debias the resulting similarities with Scaled Anomaly Penalization. The resulting scores, together with the retrieved sentences, provide real-time anomaly alerts and concise textual rationales.
  • Figure 3: Qualitative examples. The plots show frame-wise anomaly curves. Red boxes on both the video strip and the plot mark ground-truth anomalous intervals. For selected frames we list the retrieved category-caption pairs $(\kappa,c)$ and their anomaly flags $y$. Black text denotes a correct description, gray text an incorrect one. (a) & (b) The top captions describes the event precisely. (c) Flashback flags "Pickpocketing" as abnormal, but XD-Violence Wu2020not treats it as normal. (d) LAVAD zanella2024harnessing misses short anomalies and often outputs malformed sentences, whereas Flashback detects the event and returns a concise caption. (e) Removing repulsive prompting (RP) causes frequent false alarms on a normal clip.
  • Figure 4: T-SNE embeddings of caption features. We subsample 5,000 normal-anomalous caption pairs and visualize (a) before and (b) after applying repulsive prompting (RP). RP clearly separates the two groups.
  • Figure 5: AUC vs. scale factor $\bm\alpha$. A mild reduction ($\alpha\!\approx\!0.95$) yields favorable AUC, confirming that scaled anomaly penalization is effective without fine-tuning.