Table of Contents
Fetching ...

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu

Abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.
Paper Structure (41 sections, 7 equations, 20 figures, 12 tables)

This paper contains 41 sections, 7 equations, 20 figures, 12 tables.

Figures (20)

  • Figure 1: Comparison of context construction and evaluation. (Left) Existing benchmarks: Evaluate isolated preferences via sparse, "Needle-in-a-Haystack" retrieval. (Right) PERMA: Implements an event-driven paradigm where preferences are integrated over time and across sessions to assess the capabilities of memory systems.
  • Figure 2: The PERMA pipeline for dialogue construction and evaluation. Left: The dialogue construction pipeline leverages User Profiles and domain-specific Interaction Summaries to generate a structured timeline. Right: Evaluation of Task Events is conducted through two protocols: (1) One-shot MCQ probing, which measures selection accuracy across three evaluation dimensions to assess zero-shot preference recall; and (2) Interactive evaluation, involving multi-turn dialogues where a user simulator assesses task completion and preference satisfaction, while providing corrective feedback for suboptimal responses. Both evaluation protocols are executed across varying temporal depths within the full dialogue history.
  • Figure 3: Performance comparison of memory systems across evaluation checkpoints in the Clean setting (Single). (Left) MCQAcc. across three checkpoint types. (Right) MemoryScore across the checkpoint types. From Type 1 to Type 3, temporal depth and cross-domain interference of the dialogue history increase (see Definition \ref{['def:type']} for checkpoint type specification).
  • Figure 4: Comprehensive comparison of model and memory system performance across Clean and Noise single-domain scenarios: (Left) MCQAcc. of standalone LLMs, (Center) MCQAcc. of memory systems based on GPT-4o-mini, (Right) Turn=1 Success Rate of memory systems.
  • Figure 5: MCQAcc. of standalone LLMs at different evaluation checkpoints. Results are categorized by single-domain (Left) and (Right) multi-domain settings.
  • ...and 15 more figures

Theorems & Definitions (5)

  • definition 1
  • definition 2
  • definition 3
  • definition 4
  • definition 5