GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory
Jeong Hun Yeo, Sangyun Chung, Sungjune Park, Dae Hoe Kim, Jinyoung Moon, Yong Man Ro
TL;DR
This paper addresses the challenge of long-video understanding under token and temporal dependency constraints in multimodal large language models. It introduces GCAgent, a global-context-aware agent framework that builds a schematic and narrative episodic memory prior to query processing via a Memory Manager, and performs query-driven reasoning with a separate Reasoning Agent in a Perception–Action–Reflection cycle. Empirical results show substantial gains, including up to 23.5% accuracy improvements on the Video-MME Long split and state-of-the-art performance among 7B-scale MLLMs on various long-video benchmarks, validating the effectiveness of explicit global-context memory for cognitively-inspired video understanding. The work emphasizes a synergistic combination of memory-based global context and local evidence grounding, while also noting multilingual degradation and computational overhead as areas for further improvement. Overall, the approach provides a scalable path toward human-like long-video reasoning by structuring and leveraging narrative memory rather than relying solely on token expansion or retrieval alone.
Abstract
Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5\% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4\% accuracy on the Long split and the highest overall average (71.9\%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.
