Table of Contents
Fetching ...

Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang

TL;DR

This paper presents AgentFail, a dataset of 307 real-world failure cases collected from two representative agentic workflow platforms, and analyzes failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow.

Abstract

Agentic workflows built on low-code orchestration platforms enable rapid development of multi-agent systems, but they also introduce new and poorly understood failure modes that hinder reliability and maintainability. Unlike traditional software systems, failures in agentic workflows often propagate across heterogeneous nodes through natural-language interactions, tool invocations, and dynamic control logic, making failure attribution and repair particularly challenging. In this paper, we present an empirical study of platform-orchestrated agentic workflows from a failure lifecycle perspective, with the goal of characterizing failure manifestations, identifying underlying root causes, and examining corresponding repair strategies. We present AgentFail, a dataset of 307 real-world failure cases collected from two representative agentic workflow platforms. Based on this dataset, we analyze failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow. Our findings reveal key failure mechanisms in agentic workflows and provide actionable guidelines for reliable failure repair, and real-world agentic workflow design.

Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

TL;DR

This paper presents AgentFail, a dataset of 307 real-world failure cases collected from two representative agentic workflow platforms, and analyzes failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow.

Abstract

Agentic workflows built on low-code orchestration platforms enable rapid development of multi-agent systems, but they also introduce new and poorly understood failure modes that hinder reliability and maintainability. Unlike traditional software systems, failures in agentic workflows often propagate across heterogeneous nodes through natural-language interactions, tool invocations, and dynamic control logic, making failure attribution and repair particularly challenging. In this paper, we present an empirical study of platform-orchestrated agentic workflows from a failure lifecycle perspective, with the goal of characterizing failure manifestations, identifying underlying root causes, and examining corresponding repair strategies. We present AgentFail, a dataset of 307 real-world failure cases collected from two representative agentic workflow platforms. Based on this dataset, we analyze failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow. Our findings reveal key failure mechanisms in agentic workflows and provide actionable guidelines for reliable failure repair, and real-world agentic workflow design.

Paper Structure

This paper contains 19 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Example of failure lifecycle in agentic workflow.
  • Figure 2: Statistic of Distribution. In (c), each concentric ring indicates the number of failure cases.
  • Figure 3: The process of failure attribution.
  • Figure 4: Failure Root Cause and Repair Strategy Taxonomy.
  • Figure 5: Examples of failure and repair.
  • ...and 2 more figures