Table of Contents
Fetching ...

Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval

Kirandeep Kaur, Preetam Prabhu Srikar Dammu, Hideo Joho, Chirag Shah

TL;DR

The paper argues that current personalized IR evaluation relies on static benchmarks and fixed metrics, which inadequately reflect users' evolving goals. It proposes a dynamic evaluation framework grounded in simulated personas, structured elicitation (reference interviews), and longitudinal interaction loops, demonstrated via a case study with PersonalWAB in online shopping. The contributions include a three-pillar framework, a concrete multi-session protocol, and large-scale dynamic evaluation across 120 simulated users, revealing how agents adapt and where misalignment persists. The work offers a foundation for continuous, user-centric personalization assessment and motivates future memory, cross-domain, and proactive adaptation research.

Abstract

Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These limitations hinder our ability to assess whether agents can meaningfully adapt to individuals across dynamic, longitudinal interactions. In this perspective paper, we propose a conceptual lens for rethinking evaluation in adaptive personalization, shifting the focus from static performance snapshots to interaction-aware, evolving assessments. We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks. While recent works have embraced LLM-driven user simulation, we situate this practice within a broader paradigm for evaluating agents over time. To illustrate our ideas, we conduct a case study in e-commerce search using the PersonalWAB dataset. Beyond presenting a framework, our work lays a conceptual foundation for understanding and evaluating personalization as a continuous, user-centric endeavor.

Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval

TL;DR

The paper argues that current personalized IR evaluation relies on static benchmarks and fixed metrics, which inadequately reflect users' evolving goals. It proposes a dynamic evaluation framework grounded in simulated personas, structured elicitation (reference interviews), and longitudinal interaction loops, demonstrated via a case study with PersonalWAB in online shopping. The contributions include a three-pillar framework, a concrete multi-session protocol, and large-scale dynamic evaluation across 120 simulated users, revealing how agents adapt and where misalignment persists. The work offers a foundation for continuous, user-centric personalization assessment and motivates future memory, cross-domain, and proactive adaptation research.

Abstract

Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These limitations hinder our ability to assess whether agents can meaningfully adapt to individuals across dynamic, longitudinal interactions. In this perspective paper, we propose a conceptual lens for rethinking evaluation in adaptive personalization, shifting the focus from static performance snapshots to interaction-aware, evolving assessments. We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks. While recent works have embraced LLM-driven user simulation, we situate this practice within a broader paradigm for evaluating agents over time. To illustrate our ideas, we conduct a case study in e-commerce search using the PersonalWAB dataset. Beyond presenting a framework, our work lays a conceptual foundation for understanding and evaluating personalization as a continuous, user-centric endeavor.

Paper Structure

This paper contains 14 sections, 7 figures, 1 table, 4 algorithms.

Figures (7)

  • Figure 1: Illustration of two distinct user personas used in our evaluation. Persona A is a graduate student with lifestyle- and wellness-oriented preferences; Persona B is a retired user focused on quality, value, and trusted brands.
  • Figure 2: Sample queries by two users in session 1.
  • Figure 3: Structured reference interviews for two distinct personas. The agent adapts its elicitation strategy to each user's tone, goals, and preferences.
  • Figure 4: Final product recommendations for two distinct personas. The agent integrates interview context to tailor suggestions aligned with user goals, budget, and preferences.
  • Figure 5: Session B recommendations for two distinct personas. The agent integrates current queries with prior interview context to offer goal-aligned, personalized suggestions.
  • ...and 2 more figures