Table of Contents
Fetching ...

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

TL;DR

Video-EM is presented, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as anepisodic event construction followed by memory refinement, a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Abstract

Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \emph{when}, \emph{where}, \emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

TL;DR

Video-EM is presented, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as anepisodic event construction followed by memory refinement, a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Abstract

Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \emph{when}, \emph{where}, \emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the limitation of existing training‑free keyframe sampling methods. (a) Isolated frames break temporal continuity and weaken event narratives. (b) Redundant frames waste context and dilute key cues, hurting performance.
  • Figure 2: The pipeline of the Video-EM framework consists of three steps: Key Event Selection, Episodic Memory Construction, and CoT-based Video Reasoning.
  • Figure 3: Qualitative examples from HourVideo comparing our model (green) with Qwen2.5-VL (red). Frames are manually selected to highlight query-relevant events.
  • Figure 4: Qualitative examples of our Episodic Memory, composed of Dynamic Scene Narratives and Dynamic Scene Relationships.
  • Figure 5: Ablation study of the number of frames under different sampling strategies on the HourVideo dataset.
  • ...and 1 more figures