Table of Contents
Fetching ...

SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

Alejandro Cuadron, Pengfei Yu, Yang Liu, Arpit Gupta

TL;DR

The paper analyzes long-horizon LLM agents and shows that a minority of mutating actions disproportionately causes task failure. It introduces SABER, a gradient-free, model-agnostic safeguard that gates mutating steps with lightweight verification, provides targeted reasoning through reflection, and controls context via block-based filtering. Using the cleaned τ-Bench Verified and SWE-Bench Verified benchmarks, SABER achieves consistent, substantial gains across open- and closed-weight models, with larger headroom revealed when benchmark flaws are removed. The work advocates action-level analysis and reliable evaluation as essential steps toward robust, real-world multi-turn agents.

Abstract

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92\%$ on Airline and upto $96\%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28\% \emph{relative} on Airline, +11\% on Retail, and +7\% on SWE-Bench Verified; Claude: +9\%/+7\%. We further identify ceiling effects in $τ$-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release $τ$-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.

SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

TL;DR

The paper analyzes long-horizon LLM agents and shows that a minority of mutating actions disproportionately causes task failure. It introduces SABER, a gradient-free, model-agnostic safeguard that gates mutating steps with lightweight verification, provides targeted reasoning through reflection, and controls context via block-based filtering. Using the cleaned τ-Bench Verified and SWE-Bench Verified benchmarks, SABER achieves consistent, substantial gains across open- and closed-weight models, with larger headroom revealed when benchmark flaws are removed. The work advocates action-level analysis and reliable evaluation as essential steps toward robust, real-world multi-turn agents.

Abstract

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on -Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto on Airline and upto on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28\% \emph{relative} on Airline, +11\% on Retail, and +7\% on SWE-Bench Verified; Claude: +9\%/+7\%. We further identify ceiling effects in -Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release -Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.

Paper Structure

This paper contains 111 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of Targeted Reflection and Mutation-Gated User Verification
  • Figure 2: Runtime workflow of SABER. The pipeline is anchored on mutation-gated human verification: the auxiliary model inspects whether a candidate action $a_t$ is mutating and, if so, reformulates the tool call into natural language and requests user confirmation. To support this gate in long contexts, the auxiliary model (i) injects a distilled reflection of system instructions and tool constraints into a <think> block to reduce invalid mutations, and (ii) applies block-based context filtering to retain only verification-critical and goal-salient history.
  • Figure 3: Two examples of incorrect ground-truth annotations in $\tau$-Bench. (a) In Retail, the solution reuses the same product ID, violating the policy that exchanges require a different option. (b) In Airline, the solution issues a certificate without first confirming and changing/cancelling the reservation.