Table of Contents
Fetching ...

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

Yandong Yan, Junwei Peng, Shijie Li, Chenxi Li, Yifei Shang, Can Deng, Ruiting Dai, Yongqiang Zhao, Jiaqi Zhu, Yu Huang

TL;DR

This work formalizes the multi-step reasoning process as a Noisy MDP and proposes DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages that achieves the highest accuracy on every benchmark while reducing cost by 40--56% through adaptive branching.

Abstract

Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40--56% through adaptive branching. Detailed ablation studies further confirm framework-level's robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow-21D3/.

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

TL;DR

This work formalizes the multi-step reasoning process as a Noisy MDP and proposes DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages that achieves the highest accuracy on every benchmark while reducing cost by 40--56% through adaptive branching.

Abstract

Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40--56% through adaptive branching. Detailed ablation studies further confirm framework-level's robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow-21D3/.
Paper Structure (65 sections, 1 theorem, 19 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 65 sections, 1 theorem, 19 equations, 3 figures, 16 tables, 1 algorithm.

Key Result

proposition 1

Under the Noisy MDP formulation with dual-channel risk propagation (Eq. eq:propagation), the accumulated divergence proxy $\tilde{u}_T$ after $T$ steps is bounded by: where $\epsilon_t$ is the local noise at step $t$, $w_{\max} = \max_{k,t} w_{kt}$ is the maximum coupling coefficient, $\bar{w}$ is the average coupling, and $\lambda, \beta$ are the channel weights.

Figures (3)

  • Figure 1: DenoiseFlow architecture overview. Given a problem, Stage 1 (Sensing) generates $N$ Monte Carlo samples to estimate semantic uncertainty $u_t$ via clustering-based entropy, then propagates risk through the dependency graph to obtain $\tilde{u}_t$. Stage 2 (Regulating) computes execution confidence $c_t = 1 - r_t$ and routes to one of three modes: Direct (high confidence, single path), Branch (medium confidence, $K$ parallel paths with consensus selection), or Refine (low confidence, trigger Stage 3). Stage 3 (Correcting) performs influence-based tracing to localize the root cause $k^*$, applies asymmetric calibration to boost uncertainty at $k^*$, and re-executes from the corrected state. The online calibration module continuously adjusts temperature $T$ based on verifier feedback.
  • Figure 2: Hyperparameter sensitivity analysis across six benchmarks. The dark solid line shows average accuracy; red $\star$ marks the optimal value. Individual benchmarks shown as dashed lines. (a) $N{=}5$ achieves the best average; (b) $\tau_{\text{sim}}{=}0.85$ balances sensitivity; (c) $K_{\max}$ significantly impacts QA tasks; (d) $R$ is relatively insensitive.
  • Figure 3: Uncertainty calibration: risk score $r$ vs. success rate on GSM8K and MATH. The negative correlation confirms rank consistency. Error bars: std over three runs. Similar trends on other benchmarks (Appendix \ref{['app:calibration_quality']}).

Theorems & Definitions (1)

  • proposition 1: Propagation Bound