Table of Contents
Fetching ...

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu

Abstract

Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Abstract

Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.

Paper Structure

This paper contains 36 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison memory benchmarks: MemoryCD (ours) captures cross-domain real-user activities over long time horizons. LaMPsalemi2024lamp focuses only on the short-term single-domain user behaviors; LoCoMomaharana2024evaluating represents non-authentic LLM-simulated user behaviors.
  • Figure 2: The MemoryCD benchmark spans 12 real-world domains and evaluates 6 SOTA memory methods. Different from other memory benchmarks targeting one specific memory stage (mostly retrieval), we design 4 basic tasks with 2 settings to provide end-to-end user satisfaction evaluation grounded on the lifelong real user behaviors (Table \ref{['tab:task-summary']}).
  • Figure 3: Ratios of overlapping users across domains. Each cell shows the percentage of users in one domain that also appear in another, revealing feasible cross-domain personalization settings.
  • Figure 4: Performance comparisons using different memory sources evaluated on the Home & Kitchen. Each radar plot corresponds to a representative frontier long-context LLM (GPT-5, Claude-4 Sonnet, Gemini-2.5 Pro). Curves correspond to different memory sources: no memory (Source N/A), single-domain memories (Books, Electronics, Personal Care), and aggregated cross-domain memory (All 3 Sources). Scales are unified for fair comparison (MAE and RMSE scales are reversed).
  • Figure 5: Performance comparisons of 14 frontier long-context LLMs on Books based on MemoryBank.
  • ...and 3 more figures