Table of Contents
Fetching ...

Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability

Chirag Shah, Hideo Joho, Kirandeep Kaur, Preetam Prabhu Srikar Dammu

TL;DR

This paper addresses the gap in evaluating AI-driven personalized agents whose preferences evolve over time. It proposes a dynamic evaluation framework that models simulated user personas, uses structured preference elicitation via reference interviews, and employs iterative LLM-driven simulations to assess recommendations across multiple sessions and tasks. The key contributions include a seven-component framework (SIM, Personalized Agent, Tasks, Datasets, Ranked Items, Dynamic Evaluation, Measurements) and a travel-planning case study with on-task and cross-task challenges, plus a comprehensive metric schema for dynamic personalization. By enabling longitudinal, reproducible benchmarking, the approach supports development of more trustworthy, proactive personalized agents across domains such as e-commerce and entertainment.

Abstract

Recent advancements in generative AI have significantly increased interest in personalized agents. With increased personalization, there is also a greater need for being able to trust decision-making and action taking capabilities of these agents. However, the evaluation methods for these agents remain outdated and inadequate, often failing to capture the dynamic and evolving nature of user interactions. In this conceptual article, we argue for a paradigm shift in evaluating personalized and adaptive agents. We propose a comprehensive novel framework that models user personas with unique attributes and preferences. In this framework, agents interact with these simulated users through structured interviews to gather their preferences and offer customized recommendations. These recommendations are then assessed dynamically using simulations driven by Large Language Models (LLMs), enabling an adaptive and iterative evaluation process. Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies that focus on proactive, personalized, and trustworthy aspects.

Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability

TL;DR

This paper addresses the gap in evaluating AI-driven personalized agents whose preferences evolve over time. It proposes a dynamic evaluation framework that models simulated user personas, uses structured preference elicitation via reference interviews, and employs iterative LLM-driven simulations to assess recommendations across multiple sessions and tasks. The key contributions include a seven-component framework (SIM, Personalized Agent, Tasks, Datasets, Ranked Items, Dynamic Evaluation, Measurements) and a travel-planning case study with on-task and cross-task challenges, plus a comprehensive metric schema for dynamic personalization. By enabling longitudinal, reproducible benchmarking, the approach supports development of more trustworthy, proactive personalized agents across domains such as e-commerce and entertainment.

Abstract

Recent advancements in generative AI have significantly increased interest in personalized agents. With increased personalization, there is also a greater need for being able to trust decision-making and action taking capabilities of these agents. However, the evaluation methods for these agents remain outdated and inadequate, often failing to capture the dynamic and evolving nature of user interactions. In this conceptual article, we argue for a paradigm shift in evaluating personalized and adaptive agents. We propose a comprehensive novel framework that models user personas with unique attributes and preferences. In this framework, agents interact with these simulated users through structured interviews to gather their preferences and offer customized recommendations. These recommendations are then assessed dynamically using simulations driven by Large Language Models (LLMs), enabling an adaptive and iterative evaluation process. Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies that focus on proactive, personalized, and trustworthy aspects.

Paper Structure

This paper contains 13 sections, 1 figure, 2 algorithms.

Figures (1)

  • Figure 1: Overview of our evaluation framework showing the core components and their interactions.