Table of Contents
Fetching ...

Where LLM Agents Fail and How They can Learn From Failures

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You

TL;DR

This paper identifies error propagation as the central bottleneck in LLM-based agents and introduces a principled three-part solution: AgentErrorTaxonomy for modular failure types, AgentErrorBench as a real-trajectory annotated dataset, and AgentDebug as a three-stage debugging framework that localizes root causes, uses counterfactuals, and provides targeted feedback to enable iterative recovery. Empirical results show that AgentDebug outperforms strong baselines in critical error detection (e.g., All-Correct accuracy improving from 0.3% to 24.3% and Step accuracy from 28.0% to 45.0%), and yields up to 26% relative gains in task success across ALFWorld, GAIA, and WebShop. The work demonstrates that focusing on root-cause errors and actionable remediation can significantly enhance robustness and adaptability of LLM agents across embodied reasoning, web interaction, and decision-making domains. Collectively, these contributions establish principled debugging as a viable path to more reliable, self-improving LLM agents.

Abstract

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

Where LLM Agents Fail and How They can Learn From Failures

TL;DR

This paper identifies error propagation as the central bottleneck in LLM-based agents and introduces a principled three-part solution: AgentErrorTaxonomy for modular failure types, AgentErrorBench as a real-trajectory annotated dataset, and AgentDebug as a three-stage debugging framework that localizes root causes, uses counterfactuals, and provides targeted feedback to enable iterative recovery. Empirical results show that AgentDebug outperforms strong baselines in critical error detection (e.g., All-Correct accuracy improving from 0.3% to 24.3% and Step accuracy from 28.0% to 45.0%), and yields up to 26% relative gains in task success across ALFWorld, GAIA, and WebShop. The work demonstrates that focusing on root-cause errors and actionable remediation can significantly enhance robustness and adaptability of LLM agents across embodied reasoning, web interaction, and decision-making domains. Collectively, these contributions establish principled debugging as a viable path to more reliable, self-improving LLM agents.

Abstract

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

Paper Structure

This paper contains 28 sections, 26 figures, 2 tables, 1 algorithm.

Figures (26)

  • Figure 1: Motivation for AgentDebug: A single root-cause failure (b) can propagate through subsequent steps (c), compounding errors and leading to task failure. AgentDebug (d) addresses this bottleneck by tracing failures back to their source and providing actionable feedback that enables agents to evolve into more robust versions.
  • Figure 2: Pipeline of proposed AgentErrorTaxonomy and AgentErrorBench. Failed trajectories are collected, analyzed to develop a taxonomy of errors, and then annotated with root causes and actionable feedback to form the benchmark.
  • Figure 3: Distribution of failure cases in LLM agents on the AgentErrorBench
  • Figure 4: Overview of AgentDebug. (Left) LLM agent rollouts alternate between memory, planning, reflection, and action. (Right) AgentDebug debugs trajectories in three stages: (1) fine-grained analysis across steps and modules, (2) detection of the critical error that triggers failure, and (3) iterative re-rollouts with actionable feedback to turn failures into successes.
  • Figure 5: Downstream debugging performance on ALFWorld. Results are shown across three backbone models (GPT-4o-mini, Qwen3-8B, Qwen3-Next-80B) and differnt methods. AgentDebug consistently outperforms strong baselines.
  • ...and 21 more figures