Table of Contents
Fetching ...

Learning Personalized Agents from Human Feedback

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, Shuyan Zhou, Saghar Hosseini

TL;DR

PAHF tackles learning personalized agent behavior online by coupling explicit per-user memory with dual feedback channels. The three-step loop—pre-action clarification, memory-grounded action, and post-action feedback–driven memory updates—addresses both partial observability and non-stationary user preferences. The authors provide formalization, theoretical guarantees on dynamic regret, and an evaluation protocol across embodied manipulation and online shopping domains, demonstrating faster personalization and rapid adaptation to persona shifts compared with baselines. The work highlights the practical value of integrating memory with proactive and reactive feedback for robust continual personalization in interactive agents.

Abstract

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.

Learning Personalized Agents from Human Feedback

TL;DR

PAHF tackles learning personalized agent behavior online by coupling explicit per-user memory with dual feedback channels. The three-step loop—pre-action clarification, memory-grounded action, and post-action feedback–driven memory updates—addresses both partial observability and non-stationary user preferences. The authors provide formalization, theoretical guarantees on dynamic regret, and an evaluation protocol across embodied manipulation and online shopping domains, demonstrating faster personalization and rapid adaptation to persona shifts compared with baselines. The work highlights the practical value of integrating memory with proactive and reactive feedback for robust continual personalization in interactive agents.

Abstract

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.
Paper Structure (26 sections, 8 theorems, 8 equations, 12 figures, 3 tables)

This paper contains 26 sections, 8 theorems, 8 equations, 12 figures, 3 tables.

Key Result

Proposition 1

Suppose rounds are unambiguous under the agent's information state (i.e., $\gamma=0$) and preferences switch at most $K\!\ge\!1$ times. Any policy that never uses post-action feedback incurs $\mathbb{E}\!\left[\sum_{t=1}^T L_t\right]=\Omega(T)$. Conversely, any policy that performs an update upon th

Figures (12)

  • Figure 1: Static vs. continual personalization. Top: Static personalization, where offline human–AI logs are processed once to populate user profiles, and at deployment the agent only reads from this static memory. Bottom: Our continual personalization framework keeps the user in the loop during online interaction: the agent consults and updates explicit memory in the pre-action stage, and the user observes the outcome in the environment after the agent acts and provides post-action feedback, which is written back to memory to correct errors and adapt as preferences change.
  • Figure 2: Continual agent personalization using pre- and post-action feedback channels. On Day 1 (Initial Preference), the agent uses a pre-action interaction to learn Kate’s favorite drink ("Coke") before acting. On Day 2 (Contextual Preference), the agent overgeneralizes this memory when Kate is sleepy; after serving Coke, it is corrected via post-action feedback ("I prefer tea when I am sleepy"). On Day 3 (Dynamic Preference), the agent acts consistently with its memory (serving tea), but Kate’s underlying preference has shifted to coffee. This error triggers another post-action update, illustrating how the framework supports continual personalization under preference drift.
  • Figure 3: Results on embodied manipulation. Top row: Phase 1 learning curves for success rate (left), feedback frequency (middle), and average cumulative personalization error (ACPE) (right). Bottom row: Phase 3 learning curves for success rate (left), feedback frequency (middle), and ACPE (right). Shaded regions denote standard error.
  • Figure 4: Results on online shopping. Top row: Phase 1 learning curves for success rate (left), feedback frequency (middle), and average cumulative personalization error (ACPE) (right). Bottom row: Phase 3 learning curves for success rate (left), feedback frequency (middle), and ACPE (right). Shaded regions denote standard error.
  • Figure 5: Embodied manipulation results (Agent: GPT-4o, Human Simulator: GPT-4o, Memory: FAISS). Top row: Phase 1 learning curves for success rate (left), feedback frequency (middle), and average cumulative personalization error (ACPE) (right). Bottom row: Phase 3 learning curves for the same metrics. Shaded regions denote standard error.
  • ...and 7 more figures

Theorems & Definitions (17)

  • Proposition 1: Post-action feedback is necessary under preference drift
  • proof : Sketch
  • Remark 1: Context-dependent preferences
  • Proposition 2: Pre-action feedback is necessary under partial observability
  • proof : Sketch
  • Theorem 1: Complementarity of pre- and post-action feedback
  • proof : Sketch
  • Lemma 1: Error shrinkage with $m$-ary questions
  • proof
  • Lemma 2: Indistinguishability without post updates
  • ...and 7 more