Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications
Negar Arabzadeh, Julia Kiseleva, Qingyun Wu, Chi Wang, Ahmed Awadallah, Victor Dibia, Adam Fourney, Charles Clarke
TL;DR
This work tackles the problem of verifying the practical utility of LLM-powered multi-agent systems for end users. It introduces AgentEval, a two-agent framework in which CriticAgent proposes task-specific criteria and QuantifierAgent quantifies how well a given solution satisfies them, producing a task utility score U_t(s). The approach is demonstrated on Math Problem Solving and ALFWorld household tasks, showing that distinct solutions can be differentiated by criterion-based utility, and that robustness analyses reveal varying stability across criteria and tasks. The framework offers a scalable, adaptable method for ongoing alignment between agent behavior and user needs, with quantified insights into where improvements are most impactful.
Abstract
The rapid development in the field of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents to assist humans in their daily tasks. However, a significant gap remains in assessing whether LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the pressing need for methods to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval provides an implementation for the math problems, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the robustness of quantifier's work.
