Table of Contents
Fetching ...

PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction

Jialin Li, Zhenhao Chen, Hanjun Luo, Hanan Salam

TL;DR

PrefIx addresses the challenge of evaluating human-agent interaction quality alongside task accuracy by introducing a configurable environment and the Interaction-as-a-Tool paradigm (IaaT). It formalizes user experience through a taxonomy of 14 preference attributes across four dimensions and evaluates UX with a composite, multi-LLM judge across seven UX dimensions plus an alignment metric, achieving high reliability and human correlation. The study shows that preference-aware adaptation improves user experience (average ≈7.6%) and alignment (≈18.5%) without sacrificing task performance, demonstrated across multiple LLMs and BFCL-based multi-turn tasks. These contributions establish a scalable, reproducible framework for human-centered evaluation of interactive agents, with practical impact for developing more user-aligned AI assistants.

Abstract

LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC > 0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.

PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction

TL;DR

PrefIx addresses the challenge of evaluating human-agent interaction quality alongside task accuracy by introducing a configurable environment and the Interaction-as-a-Tool paradigm (IaaT). It formalizes user experience through a taxonomy of 14 preference attributes across four dimensions and evaluates UX with a composite, multi-LLM judge across seven UX dimensions plus an alignment metric, achieving high reliability and human correlation. The study shows that preference-aware adaptation improves user experience (average ≈7.6%) and alignment (≈18.5%) without sacrificing task performance, demonstrated across multiple LLMs and BFCL-based multi-turn tasks. These contributions establish a scalable, reproducible framework for human-centered evaluation of interactive agents, with practical impact for developing more user-aligned AI assistants.

Abstract

LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC > 0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.
Paper Structure (35 sections, 1 equation, 9 figures, 8 tables)

This paper contains 35 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Impact of Preference Awareness in Interaction. By inferring user preferences from history, the agent reduces unnecessary turns (right), whereas a rigid agent (left) leads to user frustration despite correct task execution.
  • Figure 2: Overview of PrefIx. Tasks from BFCL are coarsened into flexible instructions (left), a preference-aware simulator interacts with the agent expressing preferences implicitly (center), and the UX Judge evaluates the resulting trajectory across seven dimensions (right).
  • Figure 3: Tool-use accuracy comparison between Baseline (No_P) and Adaptation (P) conditions across four models. Baseline performance is comparable to BFCL leaderboard scores, confirming that interaction tools do not interfere with system tool execution.
  • Figure 4: Performance gains in interaction preference alignment across four categories, measured as the delta between adaptation and baseline conditions. Results are aggregated by category and normalized by the number of preference settings per group.
  • Figure 5: Inter-judge correlation heatmap for UX scores. Each cell shows the Pearson correlation between two LLM judges. Moderate correlations (0.4--0.7) indicate a shared underlying construct while preserving diverse perspectives.
  • ...and 4 more figures