Table of Contents
Fetching ...

M2A: Multimodal Memory Agent with Dual-Layer Hybrid Memory for Long-Term Personalized Interactions

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, Wentao Zhang

TL;DR

M2A tackles long-term, personalized multimodal interaction by converting static personalization into a co-evolving memory process. It introduces a dual-layer hybrid memory (RawMessageStore and SemanticMemoryStore) linked by evidence_ids, enabling progressive narrowing from high-level semantic observations to raw dialogue evidence. Two collaborating agents, ChatAgent and MemoryManager, execute an online Memory Update/Query loop within a ReAct-inspired workflow, supported by tri-path retrieval (dense, BM25, cross-modal) and Reciprocal Rank Fusion. The authors also present a data synthesis pipeline to inject concept-grounded multimodal sessions into long conversations and demonstrate substantial gains over baselines on visually grounded, long-context questions. Together, these contributions advance scalable, personalized, long-horizon multimodal interactions with verifiable memory updates and retrieval efficiency.

Abstract

This work addresses the challenge of personalized question answering in long-term human-machine interactions: when conversational history spans weeks or months and exceeds the context window, existing personalization mechanisms struggle to continuously absorb and leverage users' incremental concepts, aliases, and preferences. Current personalized multimodal models are predominantly static-concepts are fixed at initialization and cannot evolve during interactions. We propose M2A, an agentic dual-layer hybrid memory system that maintains personalized multimodal information through online updates. The system employs two collaborative agents: ChatAgent manages user interactions and autonomously decides when to query or update memory, while MemoryManager breaks down memory requests from ChatAgent into detailed operations on the dual-layer memory bank, which couples a RawMessageStore (immutable conversation log) with a SemanticMemoryStore (high-level observations), providing memories at different granularities. In addition, we develop a reusable data synthesis pipeline that injects concept-grounded sessions from Yo'LLaVA and MC-LLaVA into LoCoMo long conversations while preserving temporal coherence. Experiments show that M2A significantly outperforms baselines, demonstrating that transforming personalization from one-shot configuration to a co-evolving memory mechanism provides a viable path for high-quality individualized responses in long-term multimodal interactions. The code is available at https://github.com/Little-Fridge/M2A.

M2A: Multimodal Memory Agent with Dual-Layer Hybrid Memory for Long-Term Personalized Interactions

TL;DR

M2A tackles long-term, personalized multimodal interaction by converting static personalization into a co-evolving memory process. It introduces a dual-layer hybrid memory (RawMessageStore and SemanticMemoryStore) linked by evidence_ids, enabling progressive narrowing from high-level semantic observations to raw dialogue evidence. Two collaborating agents, ChatAgent and MemoryManager, execute an online Memory Update/Query loop within a ReAct-inspired workflow, supported by tri-path retrieval (dense, BM25, cross-modal) and Reciprocal Rank Fusion. The authors also present a data synthesis pipeline to inject concept-grounded multimodal sessions into long conversations and demonstrate substantial gains over baselines on visually grounded, long-context questions. Together, these contributions advance scalable, personalized, long-horizon multimodal interactions with verifiable memory updates and retrieval efficiency.

Abstract

This work addresses the challenge of personalized question answering in long-term human-machine interactions: when conversational history spans weeks or months and exceeds the context window, existing personalization mechanisms struggle to continuously absorb and leverage users' incremental concepts, aliases, and preferences. Current personalized multimodal models are predominantly static-concepts are fixed at initialization and cannot evolve during interactions. We propose M2A, an agentic dual-layer hybrid memory system that maintains personalized multimodal information through online updates. The system employs two collaborative agents: ChatAgent manages user interactions and autonomously decides when to query or update memory, while MemoryManager breaks down memory requests from ChatAgent into detailed operations on the dual-layer memory bank, which couples a RawMessageStore (immutable conversation log) with a SemanticMemoryStore (high-level observations), providing memories at different granularities. In addition, we develop a reusable data synthesis pipeline that injects concept-grounded sessions from Yo'LLaVA and MC-LLaVA into LoCoMo long conversations while preserving temporal coherence. Experiments show that M2A significantly outperforms baselines, demonstrating that transforming personalization from one-shot configuration to a co-evolving memory mechanism provides a viable path for high-quality individualized responses in long-term multimodal interactions. The code is available at https://github.com/Little-Fridge/M2A.
Paper Structure (49 sections, 5 equations, 5 figures, 2 tables)

This paper contains 49 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: M$^{2}$A enables incremental personalization with an editable multimodal memory. Unlike Yo'LLaVA and RAP-LLaVA, which keep initial concept tokens or text only profiles without write-back, M$^{2}$A updates a unified memory bank during interaction and queries it at generation time, yielding recommendations aligned with evolving preferences across long, multi session dialogs.
  • Figure 2: Overview of the $M^2A$ framework. $M^2A$ employs a multi-agent architecture consisting of a ChatAgent for user interaction and a MemoryManager for autonomous memory operations. The system leverages a Dual-Layer Hybrid Memory bank, linking high-level semantic observations in the Semantic Store to immutable conversational logs in the Raw Message Store via evidence IDs.
  • Figure 3: Overview of the proposed dataset construction pipeline. We first organize source images into semantic Concept Groups. Then, a unified One-Call Generation strategy produces concept-grounded dialogues and QA pairs. Finally, these generated sub-narratives are seamlessly interpolated into the original LoCoMo sessions to create the final hybrid dialog.
  • Figure 4: Detailed results on Visual-Centric questions.
  • Figure 5: Detailed results on Visual-Centric questions.