Table of Contents
Fetching ...

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

TL;DR

This paper tackles robustness gaps in conversational AI by introducing TraitBasis, an activation-space direction method that induces high-fidelity, controllable human traits at inference without fine-tuning. By extending the $τ$-Bench benchmark to $τ$-Trait across four domains, it demonstrates meaningful performance degradations of frontier models under realistic trait perturbations and shows TraitBasis outperforms prompt-based and fine-tuning baselines in realism, fidelity, stability, and compositionality. The work provides both a practical tool for simulation-driven testing and a pathway toward more reliable AI agents in unpredictable human interactions, with open-source resources across multiple domains. Collectively, it highlights the importance of behaviorally diverse stress testing and offers a scalable, data-efficient approach to simulate and study user personas in multi-turn dialogues.

Abstract

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

TL;DR

This paper tackles robustness gaps in conversational AI by introducing TraitBasis, an activation-space direction method that induces high-fidelity, controllable human traits at inference without fine-tuning. By extending the -Bench benchmark to -Trait across four domains, it demonstrates meaningful performance degradations of frontier models under realistic trait perturbations and shows TraitBasis outperforms prompt-based and fine-tuning baselines in realism, fidelity, stability, and compositionality. The work provides both a practical tool for simulation-driven testing and a pathway toward more reliable AI agents in unpredictable human interactions, with open-source resources across multiple domains. Collectively, it highlights the importance of behaviorally diverse stress testing and offers a scalable, data-efficient approach to simulate and study user personas in multi-turn dialogues.

Abstract

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend -Bench to -Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on -Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced -Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

Paper Structure

This paper contains 35 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of our approach and comparison with prompt-based tuning. Trait prompt $P_t$ is generated using contrastive conversations, where one dialogue exhibits the target trait while the other does not. Comparison between TraitBasis and prompt-based tuning: when simulating a user with a specific trait, prompt-based tuning fails to complete the task as the simulated user behavior becomes more realistic, while TraitBasis (generated using a combination of $P_t$'s as shown in Section \ref{['sec:method']}) remains robust.
  • Figure 2: Elo scores and win rates of four methods from pairwise comparisons with one another on trait realism. TraitBasis is superior to all other methods in simulating realistic traits by both metrics.
  • Figure 3: Figure comparing rollouts between $\tau$-Bench and $\tau$-Trait. The user for $\tau$-Trait are steered () using TraitBasis which makes them exhibit traits in a strong manner and stress-test the agent thoroughly.
  • Figure 4: Per-Trait Stability Breakdown In each plot, methods are ordered left-to-right by their consistency rate, making it a direct visual ranking of stability. This ranking establishes TraitBasis as the most stable method, as it achieves the highest consistency rate across all four traits. Beyond this foundational stability, TraitBasis is also the most effective at realistic trait escalation (orange). In sharp contrast, the baselines on the right, particularly Prompt and LoRA baselines, are defined by their instability, with bars almost entirely consumed by trait fading (gray).
  • Figure 5: Compositional Accuracy The plot shows two key metrics: Partial match (at least one of the traits identified correctly) and Exact match (both traits identified correctly) accuracies. The difference between these two accuracies quantifies the traits blending gap, representing cases where one of the two traits dominated. The small difference for TraitBasis (17.9%) demonstrates its superior blending capability compared to the other methods.