Table of Contents
Fetching ...

How can we assess human-agent interactions? Case studies in software agent design

Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

TL;DR

This work tackles the challenge of evaluating human–agent interactions for LLM-powered software agents, arguing that traditional benchmarks assuming full automation miss critical user-centric dynamics. It introduces PULSE (Prediction-powered User Label Synthesis and Evaluation), a three-step framework for efficient human-centric evaluation and demonstrates its deployment on the OpenHands platform with over $36{,}000$ sessions and $15{,}000$ users. Key findings show that improving the LLM backbone yields larger gains in user satisfaction than scaffolding changes, with planning and memory management offering more modest benefits and cost savings, respectively; importantly, PULSE reduces confidence interval widths by about $39.5\%$, enabling more conclusive inferences, and reveals substantial divergence between in-the-wild results and benchmark performance. The work provides practical guidance for designing interactive software agents and offers open-source tooling to extend human-in-the-loop evaluations across domains.

Abstract

LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions -- choice of LLM backbone, planning strategy, and memory mechanisms -- impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.

How can we assess human-agent interactions? Case studies in software agent design

TL;DR

This work tackles the challenge of evaluating human–agent interactions for LLM-powered software agents, arguing that traditional benchmarks assuming full automation miss critical user-centric dynamics. It introduces PULSE (Prediction-powered User Label Synthesis and Evaluation), a three-step framework for efficient human-centric evaluation and demonstrates its deployment on the OpenHands platform with over sessions and users. Key findings show that improving the LLM backbone yields larger gains in user satisfaction than scaffolding changes, with planning and memory management offering more modest benefits and cost savings, respectively; importantly, PULSE reduces confidence interval widths by about , enabling more conclusive inferences, and reveals substantial divergence between in-the-wild results and benchmark performance. The work provides practical guidance for designing interactive software agents and offers open-source tooling to extend human-in-the-loop evaluations across domains.

Abstract

LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions -- choice of LLM backbone, planning strategy, and memory mechanisms -- impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.

Paper Structure

This paper contains 31 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: PULSE enables efficient human-centric evaluation of agent designs enables insights into how user experience varies with agent design and comparison to benchmark performance. We instantiate PULSE in a software engineering agent use case and conduct a set of case studies to identify novel insights into designing useful, collaborative agents.
  • Figure 2: Overview of statistics of in-the-wild deployment of our human-agent evaluation framework. Our evaluations spanned over 15k users who were working on a diverse set of problems in terms of programming and natural languages, task category, and interaction style with the agent (as seen through user message count).
  • Figure 3: How do user ratings change based on different software agent designs? We report average user ratings (on a scale of 1-5) using only human labels and including our framework. For fair comparison of the effect of PPI, we subsample to the same number of data points in human-only (i.e., 150 per condition). We use * to indicate significant results with a cutoff $\alpha=0.05$). We can see that the LLM model makes the biggest difference (case study 1) compared to scaffold changes (case studies 2 and 3).
  • Figure 4: Illustration of the feedback setup: (a) the 5-star feedback interface, and (b) the distribution of ratings collected.
  • Figure 5: Average API cost per-step with varying max_step values.