Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Yun Wang; Long Zhang; Jingren Liu; Jiaqi Yan; Zhanjie Zhang; Jiahao Zheng; Ao Ma; Run Ling; Xun Yang; Dapeng Wu; Xiangyu Chen; Xuelong Li

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

TL;DR

Video-EM is presented, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as anepisodic event construction followed by memory refinement, a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Abstract

Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \emph{when}, \emph{where}, \emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

TL;DR

Abstract

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)