Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
TL;DR
<3-5 sentence high-level summary> M3-Agent introduces a multimodal agent with long-term memory that continuously perceives visual and audio streams, stores experiences in an entity-centric memory graph, and performs multi-turn reasoning by retrieving memories. The framework separates memorization (online memory formation) from control (memory-based reasoning), and is trained with imitation learning for memory generation plus reinforcement learning for reasoning. To evaluate memory effectiveness, the paper presents M3-Bench, a long-video LVQA benchmark with robot-perspective and web videos, showing M3-Agent consistently surpasses strong baselines across robot, web, and long-video tasks. The work highlights the importance of semantic memory and iterative memory retrieval for robust cross-modal reasoning in long-horizon, real-world scenarios, and offers practical design insights for scalable memory-centric agents.
Abstract
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.
