Table of Contents
Fetching ...

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

Negar Arabzadeh, Julia Kiseleva, Qingyun Wu, Chi Wang, Ahmed Awadallah, Victor Dibia, Adam Fourney, Charles Clarke

TL;DR

This work tackles the problem of verifying the practical utility of LLM-powered multi-agent systems for end users. It introduces AgentEval, a two-agent framework in which CriticAgent proposes task-specific criteria and QuantifierAgent quantifies how well a given solution satisfies them, producing a task utility score U_t(s). The approach is demonstrated on Math Problem Solving and ALFWorld household tasks, showing that distinct solutions can be differentiated by criterion-based utility, and that robustness analyses reveal varying stability across criteria and tasks. The framework offers a scalable, adaptable method for ongoing alignment between agent behavior and user needs, with quantified insights into where improvements are most impactful.

Abstract

The rapid development in the field of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents to assist humans in their daily tasks. However, a significant gap remains in assessing whether LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the pressing need for methods to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval provides an implementation for the math problems, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the robustness of quantifier's work.

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

TL;DR

This work tackles the problem of verifying the practical utility of LLM-powered multi-agent systems for end users. It introduces AgentEval, a two-agent framework in which CriticAgent proposes task-specific criteria and QuantifierAgent quantifies how well a given solution satisfies them, producing a task utility score U_t(s). The approach is demonstrated on Math Problem Solving and ALFWorld household tasks, showing that distinct solutions can be differentiated by criterion-based utility, and that robustness analyses reveal varying stability across criteria and tasks. The framework offers a scalable, adaptable method for ongoing alignment between agent behavior and user needs, with quantified insights into where improvements are most impactful.

Abstract

The rapid development in the field of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents to assist humans in their daily tasks. However, a significant gap remains in assessing whether LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the pressing need for methods to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval provides an implementation for the math problems, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the robustness of quantifier's work.
Paper Structure (25 sections, 10 figures, 3 tables)

This paper contains 25 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: An overview of the AgentEval framework consists of two main components: (C) CriticAgent, which learns a list of $n$ criteria ($C=\{c_1,\dots, c_n\}$) and suggested values for each criterion ($c_i: \{\omega_j\}_{j=1}^m$), where $m$ is the number of suggested values, applicable to an arbitrary application that can be assessed by a domain expert; and (Q) QuantifierAgent, which verifies a set of suggested criteria for a considered application and suggests a task utility for an end-user ($U_t(s)=\{Q_i(s|c_i)\}_{i=1}^n$)
  • Figure 2: The taxonomy of task assessments based on optimal solutions existence
  • Figure 3: (a) AgentEval assessment of three different solutions on math problem solving task categorized (b) Same assessment categorized by success and failed cases
  • Figure 4: (a) AgentEval assessment of three different solutions on AlfWorld Householding Task (b) Same assessment categorized by success and failed cases.
  • Figure 5: Task based criteria vs solution based criteria for Math problems. show the 95% interval at each step
  • ...and 5 more figures