Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

Negar Arabzadeh; Julia Kiseleva; Qingyun Wu; Chi Wang; Ahmed Awadallah; Victor Dibia; Adam Fourney; Charles Clarke

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

Negar Arabzadeh, Julia Kiseleva, Qingyun Wu, Chi Wang, Ahmed Awadallah, Victor Dibia, Adam Fourney, Charles Clarke

TL;DR

This work tackles the problem of verifying the practical utility of LLM-powered multi-agent systems for end users. It introduces AgentEval, a two-agent framework in which CriticAgent proposes task-specific criteria and QuantifierAgent quantifies how well a given solution satisfies them, producing a task utility score U_t(s). The approach is demonstrated on Math Problem Solving and ALFWorld household tasks, showing that distinct solutions can be differentiated by criterion-based utility, and that robustness analyses reveal varying stability across criteria and tasks. The framework offers a scalable, adaptable method for ongoing alignment between agent behavior and user needs, with quantified insights into where improvements are most impactful.

Abstract

The rapid development in the field of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents to assist humans in their daily tasks. However, a significant gap remains in assessing whether LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the pressing need for methods to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval provides an implementation for the math problems, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the robustness of quantifier's work.

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

TL;DR

Abstract

Paper Structure (25 sections, 10 figures, 3 tables)

This paper contains 25 sections, 10 figures, 3 tables.

Introduction
Related Work
LLM evaluation
User satisfaction prediction
Using LLMs as evaluators
Defining Task Utility
Datasets and Solutions
MATH Problem Solving
ALFWorld Household Task
AgentEval Workflow
AgentEval for Math Problems
Critic and Quantifier Findings
AgentEval for AlfWorld
Critic and Quantifier Finding
AgentEval Robustness Analysis and In-depth Discussion
...and 10 more sections

Figures (10)

Figure 1: An overview of the AgentEval framework consists of two main components: (C) CriticAgent, which learns a list of $n$ criteria ($C=\{c_1,\dots, c_n\}$) and suggested values for each criterion ($c_i: \{\omega_j\}_{j=1}^m$), where $m$ is the number of suggested values, applicable to an arbitrary application that can be assessed by a domain expert; and (Q) QuantifierAgent, which verifies a set of suggested criteria for a considered application and suggests a task utility for an end-user ($U_t(s)=\{Q_i(s|c_i)\}_{i=1}^n$)
Figure 2: The taxonomy of task assessments based on optimal solutions existence
Figure 3: (a) AgentEval assessment of three different solutions on math problem solving task categorized (b) Same assessment categorized by success and failed cases
Figure 4: (a) AgentEval assessment of three different solutions on AlfWorld Householding Task (b) Same assessment categorized by success and failed cases.
Figure 5: Task based criteria vs solution based criteria for Math problems. show the 95% interval at each step
...and 5 more figures

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

TL;DR

Abstract

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (10)