Table of Contents
Fetching ...

EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models

Yiqun Yao, Naitong Yu, Xiang Li, Xin Jiang, Xuezhi Fang, Wenjia Ma, Xuying Meng, Jing Li, Aixin Sun, Yequan Wang

TL;DR

EgoMem introduces lifelong memory for real-time, full-duplex omnimodal models by decomposing memory management into three asynchronous processes—retrieval, omnimodal dialog, and memory updating—and supports two memory levels: profile-only (Level-1) and content-driven social-graph memory (Level-2). Integrated with RoboEgo, EgoMem demonstrates high retrieval accuracy (>95%) and fact-consistency (>87%) in personalized dialogs, validating the viability of lifelong personalization in embodied agents. The work provides concrete data pipelines, training masks, and evaluation benchmarks, establishing a strong baseline for future research in lifelong, embodied memory and omnimodal interactions.

Abstract

We introduce EgoMem, the first lifelong memory agent tailored for full-duplex models that process real-time omnimodal streams. EgoMem enables real-time models to recognize multiple users directly from raw audiovisual streams, to provide personalized response, and to maintain long-term knowledge of users' facts, preferences, and social relationships extracted from audiovisual history. EgoMem operates with three asynchronous processes: (i) a retrieval process that dynamically identifies user via face and voice, and gathers relevant context from a long-term memory; (ii) an omnimodal dialog process that generates personalized audio responses based on the retrieved context; and (iii) a memory management process that automatically detects dialog boundaries from omnimodal streams, and extracts necessary information to update the long-term memory. Unlike existing memory agents for LLMs, EgoMem relies entirely on raw audiovisual streams, making it especially suitable for lifelong, real-time, and embodied scenarios. Experimental results demonstrate that EgoMem's retrieval and memory management modules achieve over 95% accuracy on the test set. When integrated with a fine-tuned RoboEgo omnimodal chatbot, the system achieves fact-consistency scores above 87% in real-time personalized dialogs, establishing a strong baseline for future research.

EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models

TL;DR

EgoMem introduces lifelong memory for real-time, full-duplex omnimodal models by decomposing memory management into three asynchronous processes—retrieval, omnimodal dialog, and memory updating—and supports two memory levels: profile-only (Level-1) and content-driven social-graph memory (Level-2). Integrated with RoboEgo, EgoMem demonstrates high retrieval accuracy (>95%) and fact-consistency (>87%) in personalized dialogs, validating the viability of lifelong personalization in embodied agents. The work provides concrete data pipelines, training masks, and evaluation benchmarks, establishing a strong baseline for future research in lifelong, embodied memory and omnimodal interactions.

Abstract

We introduce EgoMem, the first lifelong memory agent tailored for full-duplex models that process real-time omnimodal streams. EgoMem enables real-time models to recognize multiple users directly from raw audiovisual streams, to provide personalized response, and to maintain long-term knowledge of users' facts, preferences, and social relationships extracted from audiovisual history. EgoMem operates with three asynchronous processes: (i) a retrieval process that dynamically identifies user via face and voice, and gathers relevant context from a long-term memory; (ii) an omnimodal dialog process that generates personalized audio responses based on the retrieved context; and (iii) a memory management process that automatically detects dialog boundaries from omnimodal streams, and extracts necessary information to update the long-term memory. Unlike existing memory agents for LLMs, EgoMem relies entirely on raw audiovisual streams, making it especially suitable for lifelong, real-time, and embodied scenarios. Experimental results demonstrate that EgoMem's retrieval and memory management modules achieve over 95% accuracy on the test set. When integrated with a fine-tuned RoboEgo omnimodal chatbot, the system achieves fact-consistency scores above 87% in real-time personalized dialogs, establishing a strong baseline for future research.

Paper Structure

This paper contains 29 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Textual memory agents vs. full-duplex omnimodal memory agents (ours).
  • Figure 2: System illustration for EgoMem Level-1 (Profile-only).
  • Figure 3: System illustration for EgoMem Level-2 (Content-driven). We focus on showing the differences in retrieval process and hide the details for other processes like memory management.
  • Figure 4: Token stream structure and supervision mask for EgoMem training data.