Table of Contents
Fetching ...

Failure-Driven Workflow Refinement

Jusheng Zhang, Kaitong Cai, Qinglin Zeng, Ningyuan Liu, Stephen Fan, Ziliang Chen, Keze Wang

TL;DR

This work reframes LLM workflow optimization as minimizing the Expected Failure Mass within a Failure Signature Space, preserving the structure of failures rather than collapsing them to binary success/failure signals. It introduces CE-Graph, a failure-driven refinement framework that clusters recurring failure modes from counterexamples and applies constrained graph edits via a Propose-and-Verify loop to greedily reduce failure mass. Empirically, CE-Graph delivers higher robustness at lower optimization cost across math, code, QA benchmarks and demonstrates transferability across models and datasets. The results suggest reliability emerges from learning and reshaping the geometry of failure distributions rather than simply avoiding failures, although the approach relies on detectable failure modes and incurs verification costs. Limitations include dependency on meaningful failure clustering and the need to carefully design operator libraries, with future work focusing on adaptive embeddings and risk-aware objectives.

Abstract

Optimizing LLM-based workflows is typically formulated as a global search, where candidate workflows are evaluated based on a scalar metric. This paradigm, however, suffers from a critical flaw: information collapse. By reducing rich, multi-step execution traces to simple success/failure signals, existing methods are rendered blind to the underlying structure of failures, fundamentally preventing them from modeling the workflow's failure distribution. We reconceptualize this challenge as a distributional problem. We propose a new paradigm where the optimization goal is not to maximize a scalar score, but to directly minimize a workflow's Expected Failure Mass, i.e., the integral of its failure probability density function defined over a high-dimensional Failure Signature Space (FSS). This distributional lens allows us to move from inefficient, zero-order optimization to a principled, gradient-like descent on the failure landscape itself. We introduce CE-Graph, a framework that operationalizes this paradigm through a novel, failure-driven refinement process. CE-Graph approximates the failure distribution from a pool of counterexamples, identifies its densest regions as recurring failure modes, and applies targeted, operator-constrained graph edits via a Propose-and-Verify mechanism to greedily reduce the failure mass. On math, code, and QA benchmarks, our CE-Graph achieves higher robustness at a significantly lower cost than strong baselines. This suggests that a system's reliability emerges not from avoiding failures, but from systematically learning and reshaping the geometric structure of its failure distributions.

Failure-Driven Workflow Refinement

TL;DR

This work reframes LLM workflow optimization as minimizing the Expected Failure Mass within a Failure Signature Space, preserving the structure of failures rather than collapsing them to binary success/failure signals. It introduces CE-Graph, a failure-driven refinement framework that clusters recurring failure modes from counterexamples and applies constrained graph edits via a Propose-and-Verify loop to greedily reduce failure mass. Empirically, CE-Graph delivers higher robustness at lower optimization cost across math, code, QA benchmarks and demonstrates transferability across models and datasets. The results suggest reliability emerges from learning and reshaping the geometry of failure distributions rather than simply avoiding failures, although the approach relies on detectable failure modes and incurs verification costs. Limitations include dependency on meaningful failure clustering and the need to carefully design operator libraries, with future work focusing on adaptive embeddings and risk-aware objectives.

Abstract

Optimizing LLM-based workflows is typically formulated as a global search, where candidate workflows are evaluated based on a scalar metric. This paradigm, however, suffers from a critical flaw: information collapse. By reducing rich, multi-step execution traces to simple success/failure signals, existing methods are rendered blind to the underlying structure of failures, fundamentally preventing them from modeling the workflow's failure distribution. We reconceptualize this challenge as a distributional problem. We propose a new paradigm where the optimization goal is not to maximize a scalar score, but to directly minimize a workflow's Expected Failure Mass, i.e., the integral of its failure probability density function defined over a high-dimensional Failure Signature Space (FSS). This distributional lens allows us to move from inefficient, zero-order optimization to a principled, gradient-like descent on the failure landscape itself. We introduce CE-Graph, a framework that operationalizes this paradigm through a novel, failure-driven refinement process. CE-Graph approximates the failure distribution from a pool of counterexamples, identifies its densest regions as recurring failure modes, and applies targeted, operator-constrained graph edits via a Propose-and-Verify mechanism to greedily reduce the failure mass. On math, code, and QA benchmarks, our CE-Graph achieves higher robustness at a significantly lower cost than strong baselines. This suggests that a system's reliability emerges not from avoiding failures, but from systematically learning and reshaping the geometric structure of its failure distributions.

Paper Structure

This paper contains 61 sections, 1 theorem, 22 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $\Delta_t$ be selected as in Eq. eq:greedy_mass_reduction. If the edit reduces the mass in the target mode by at least $\delta > 0$, then $M(W_{t+1}) \leq M(W_t) - \delta + \epsilon$, where $\epsilon = O(L \cdot B \cdot \mu(\mathcal{F} \setminus b_t^*))$ bounds spillover effects to non-target re

Figures (8)

  • Figure 1: Upper: traditional LLMs compress rich error traces into binary signals, causing "information collapse" and obscuring the underlying failure distribution, which hinders systematic refinement. Lower: CE-Graph leverages the Failure Signature Space to cluster errors into coherent modes
  • Figure 2: Overview of our CE-Graph framework. The process iteratively refines workflows by (i) distilling raw failure traces into structured signatures, (ii) clustering to expose dense failure modes, and (iii) applying targeted Propose-and-Verify graph edits, enabling principled descent on the failure landscape.
  • Figure 3: CE-Graph outperforms all baselines on GAIA (Levels 1–3, Avg.) and achieves the best accuracy–efficiency trade-off (tokens, API cost) in the ideal lower-right region.
  • Figure 4: The refinement accuracy on the fixed failure set ($E_0$) over 20 optimization rounds across three mathematical reasoning benchmarks: GSM8K, MATH, and MultiArith. CE-Graph (orange) demonstrates a smooth and monotonically increasing trajectory on all datasets, highlighting the stability and accumulative effect of its refinement process. In contrast, baseline methods, especially AFlow (purple), exhibit significant performance fluctuations, reflecting instability induced by ad-hoc refinement strategies and policy oscillation.
  • Figure 5: Visualization of the iterative failure-driven refinement cycle in CE-Graph. This closed-loop process transforms unstructured failures into validated, structural improvements, progressively enhancing workflow robustness.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: Greedy Reduction Bound
  • proof : Proof Sketch
  • Definition 1: Failure Signature Vector