Table of Contents
Fetching ...

VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

Yoo Yeon Sung, Hannah Kim, Dan Zhang

TL;DR

VeriLA addresses the challenge of interpreting and auditing failures in LLM-based agent systems by introducing a human-centered evaluation framework that stages task solving into planning, execution, and verification. It leverages a graph-based plan with a human-designed agent registry, a separate human-aligned verifier trained on ground-truth labels, and aggregation metrics to predict overall task failure, all augmented with uncertainty and plan-structure features. A mathematical reasoning case study across four datasets demonstrates high verifier accuracy and actionable insights for diagnosing failure propagation and guiding revisions. The framework enhances transparency, accountability, and efficiency in human-in-the-loop AI systems and is poised for expansion to broader domains such as open-domain QA and fact-checking.

Abstract

AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.

VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

TL;DR

VeriLA addresses the challenge of interpreting and auditing failures in LLM-based agent systems by introducing a human-centered evaluation framework that stages task solving into planning, execution, and verification. It leverages a graph-based plan with a human-designed agent registry, a separate human-aligned verifier trained on ground-truth labels, and aggregation metrics to predict overall task failure, all augmented with uncertainty and plan-structure features. A mathematical reasoning case study across four datasets demonstrates high verifier accuracy and actionable insights for diagnosing failure propagation and guiding revisions. The framework enhances transparency, accountability, and efficiency in human-in-the-loop AI systems and is poised for expansion to broader domains such as open-domain QA and fact-checking.

Abstract

AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.

Paper Structure

This paper contains 33 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of VeriLA. Our framework operates in three main stages (1) planning where a planning agent decomposes a task into subtasks using a human-designed agent registry and generates a plan graph; (2) agent execution where specialized LLM agents perform the subtasks; and (3) execution verification, which verifies each LLM agent's outputs based on human-defined agent criteria, agent uncertainty, and dependency information from the plan structure. We then assess task failure with aggregation metrics that combine verifier scores. Our framework guides users to detect task failures efficiently, identify faulty agents, and analyze the root causes of their failure.
  • Figure 2: Example of agent's failure propagating to overall task failure. For example, based on the generated plan from the planning agent, each agent should accurately execute their subtasks. The first "subtract" agent failed to calculate the remaining eggs, causing subsequent "subtract" and "multiply" agents to lack the necessary context for a successful execution (three red boxes). An agent-specific verifier can help users trace the error propagation, identify the root cause of the error, and understand how it led to the task failure.
  • Figure 3: Verifier accuracy across datasets. The test accuracy remains consistently high across subtasks, without bias toward any specific one. Similar subtasks, like "Add" and "Subtract," which share the same criteria, also show comparable accuracies across all datasets.
  • Figure 4: Ablation study on different feature configurations evaluating verifiers' test accuracy. Human-defined agent criteria feature enhances its performance, showing the highest accuracy when all features are used.
  • Figure 5: Aggregation performance measured by failure rate across aggregation score percentiles. They all show an upward trend, suggesting that they can help users prioritize tasks more likely to fail, when the labor budget is limited, allowing auditing of high-risk tasks first. Overall, mean and outdegree showed stable performance across datasets and can be used as default aggregation metrics for new datasets.
  • ...and 2 more figures