Table of Contents
Fetching ...

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

TL;DR

<3-5 sentence high-level summary> M3-Agent introduces a multimodal agent with long-term memory that continuously perceives visual and audio streams, stores experiences in an entity-centric memory graph, and performs multi-turn reasoning by retrieving memories. The framework separates memorization (online memory formation) from control (memory-based reasoning), and is trained with imitation learning for memory generation plus reinforcement learning for reasoning. To evaluate memory effectiveness, the paper presents M3-Bench, a long-video LVQA benchmark with robot-perspective and web videos, showing M3-Agent consistently surpasses strong baselines across robot, web, and long-video tasks. The work highlights the importance of semantic memory and iterative memory retrieval for robust cross-modal reasoning in long-horizon, real-world scenarios, and offers practical design insights for scalable memory-centric agents.

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

TL;DR

<3-5 sentence high-level summary> M3-Agent introduces a multimodal agent with long-term memory that continuously perceives visual and audio streams, stores experiences in an entity-centric memory graph, and performs multi-turn reasoning by retrieving memories. The framework separates memorization (online memory formation) from control (memory-based reasoning), and is trained with imitation learning for memory generation plus reinforcement learning for reasoning. To evaluate memory effectiveness, the paper presents M3-Bench, a long-video LVQA benchmark with robot-perspective and web videos, showing M3-Agent consistently surpasses strong baselines across robot, web, and long-video tasks. The work highlights the importance of semantic memory and iterative memory retrieval for robust cross-modal reasoning in long-horizon, real-world scenarios, and offers practical design insights for scalable memory-centric agents.

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

Paper Structure

This paper contains 46 sections, 4 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: Architecture of M3-Agent, comprising a multimodal large language model (MLLM) and a multimodal long-term memory. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively reasoning and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.
  • Figure 2: Examples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.
  • Figure 3: Statistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.
  • Figure 4: Average scores (on training set) and accuracy (on dev set) curves during the DAPO training process. The smoothing method of the curve in the left figure is the exponential moving average(EMA) formula that aligns with the one used in WandB, and the smoothing weight is set to 0.9
  • Figure :