Table of Contents
Fetching ...

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, Camillo Jose Taylor

TL;DR

The paper addresses the challenge of inferring implicit user personas in long, noisy interactions and delivering personalized, context-aware responses. It introduces PersonaMem-v2, a large-scale dataset with 1,000 implicit personas, 20k+ preferences, and 128k-token contexts, plus robust multi-session histories and privacy-aware design, enabling reinforcement learning and agentic memory experiments. Through GRPO-based reinforcement fine-tuning, a 4B reasoning model surpasses GPT-5 on implicit personalization, and an agentic memory framework compresses histories into a 2k-token memory to achieve state-of-the-art performance with 16x efficiency gains, while maintaining memory transparency. Together, these contributions point to a scalable path toward real-world personalized intelligence with interpretable memory and stronger alignment to individual user needs.

Abstract

Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

TL;DR

The paper addresses the challenge of inferring implicit user personas in long, noisy interactions and delivering personalized, context-aware responses. It introduces PersonaMem-v2, a large-scale dataset with 1,000 implicit personas, 20k+ preferences, and 128k-token contexts, plus robust multi-session histories and privacy-aware design, enabling reinforcement learning and agentic memory experiments. Through GRPO-based reinforcement fine-tuning, a 4B reasoning model surpasses GPT-5 on implicit personalization, and an agentic memory framework compresses histories into a 2k-token memory to achieve state-of-the-art performance with 16x efficiency gains, while maintaining memory transparency. Together, these contributions point to a scalable path toward real-world personalized intelligence with interpretable memory and stronger alignment to individual user needs.

Abstract

Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.

Paper Structure

This paper contains 25 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of PersonaMem-v2 benchmark. On the left, we mimic how people talk to chatbots across many topics over time, forming long and varied interaction histories. In the middle, we zoom in on a few conversation snippets. Even simple task scenarios like refining an email or translating a message naturally reveal small details about someone’s life in many other scenarios, like what they study, what foods they like, whether they have pets, or what they’re planning for tomorrow. Taken together, these subtle signals help models build thorough yet succinct user profiles for personalization.
  • Figure 2: What’s inside the PersonaMem-v2 dataset. It spans broad and diverse distributions of user personas, preferences, and task scenarios, ranging from everyday interests like food, travel, and hobbies to demographic attributes, personality vibes, and professional backgrounds. This wide coverage is designed to better support training and evaluating personalized AI systems that reflect real-world users and use cases.
  • Figure 3: Schematic of our RL-based training strategies. The top figure illustrates long-context reasoning, where the model reasons over the full conversation history. The bottom one depicts agentic memory, where the model divides the full history into chunks and iteratively updates a memory of capped size. The model will receive a high reward if the memory turns out to be necessary and helpful in answering the final user query.
  • Figure 4: Performance of OpenAI models on the PersonaMem-v2 benchmark, comparing accuracy across 32k and 128k input contexts in MCQ and Open-Ended tasks. Despite recent advancements, we observe that frontier models still struggle with implicit personalization. The dashed line represents the random guess performance in MCQ tasks.
  • Figure 5: Breakdown of model accuracy by preference types and scenarios, aggregated across all evaluated models. The results indicate systematic variance in personalization capabilities, showing a reliance on population priors, with performance dropping for anti-stereotypical and dynamic preferences and distinguishing preference ownerships.
  • ...and 2 more figures