Table of Contents
Fetching ...

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

Siwei Wen, Zhangcheng Wang, Xingjian Zhang, Lei Huang, Wenjun Wu

TL;DR

This work introduces EventMemAgent, an active online video agent framework based on a hierarchical memory module that integrates a multi-granular perception toolkit for active, iterative evidence capture and employs Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities.

Abstract

Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

TL;DR

This work introduces EventMemAgent, an active online video agent framework based on a hierarchical memory module that integrates a multi-granular perception toolkit for active, iterative evidence capture and employs Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities.

Abstract

Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.
Paper Structure (33 sections, 5 equations, 5 figures, 5 tables)

This paper contains 33 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of EventMemAgent. It consists of three core components: a hierarchical memory module that archives video streams into structured event-centric representations, a multi-granular perception toolkit for active, iterative evidence capture, and an agentic reinforcement learning framework that optimizes tool-use strategies. The agent dynamically retrieves memory and utilizes perception tools to answer questions at specific timestamps.
  • Figure 2: Comparison of memory management strategies. Fixed-length Memory (top) frequently suffers from semantic fragmentation and information redundancy due to rigid temporal boundaries that bisect continuous actions. In contrast, our event-centric hierarchical memory (bottom) preserves semantic continuity and information compactness.
  • Figure 3: Qualitative comparison of reasoning trajectories on OVO-Bench. (Left) The untrained agent fails to use tools flexibly. (Right) EventMemAgent successfully internalizes complex reasoning and tool-use strategies, allowing it to adaptively retrieve information from long-term memory and use tools for more precise observations to provide accurate answers.
  • Figure 4: Analysis of tool usage patterns on OVO-Bench before and after training. (a) Distribution of tool types invoked across the three task categories. (b) Distribution of the number of tool calls per sample.
  • Figure 5: Training Statistics of Agentic RL. The evolution of Average Reward (left), Number of Turns (middle), and Response Length (right) throughout the training process. The steady increase in both reward and response length demonstrates the model's improving capability in complex reasoning and tool usage.