Table of Contents
Fetching ...

PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, Maheep Chaudhary

TL;DR

Tool-driven agents often fail in real-world deployments due to tool-timeouts and API errors. PALADIN presents a unified framework that (i) injects failures and trains on recovery-annotated trajectories (50K+) with a 55+ exemplar recovery bank, and (ii) augments inference with taxonomy-guided retrieval to enact recovery actions for unseen failures. The approach yields substantial gains on execution robustness metrics, notably raising Recovery Rate from $32.76\%$ to $89.68\%$ and improving Task Success Rate and Catastrophic Success Rate across multiple backbones, while maintaining reasonable efficiency. PALADIN’s recovery supervision and retrieval-based recovery demonstrate that robust, fault-tolerant tool use is a learnable, transferable capability, enabling safer, more reliable real-world deployments of tool-augmented LLM agents.

Abstract

Tool-augmented language agents frequently fail in real-world deployment due to tool malfunctions--timeouts, API exceptions, or inconsistent outputs--triggering cascading reasoning errors and task abandonment. Existing agent training pipelines optimize only for success trajectories, failing to expose models to the tool failures that dominate real-world usage. We propose \textbf{PALADIN}, a generalizable framework for equipping language agents with robust failure recovery capabilities. PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection and expert demonstrations on an enhanced ToolBench dataset. Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence. At inference, PALADIN detects execution-time errors and retrieves the most similar case from a curated bank of 55+ failure exemplars aligned with ToolScan's taxonomy, then executes the corresponding recovery action. This approach generalizes to novel failures beyond the training distribution, retaining 95.2\% recovery performance on unseen tool APIs. Evaluation across PaladinEval and ToolReflectEval demonstrates consistent improvements in Recovery Rate (RR), Task Success Rate (TSR), Catastrophic Success Rate (CSR), and Efficiency Score (ES). PALADIN improves RR from 32.76% to 89.68% (+57% relative) over ToolBench and outperforms the strongest baseline CRITIC (76.34%) by +13.3%. Against vanilla agents, PALADIN achieves 89.86\% RR (+66% relative improvement from 23.75%). These results establish PALADIN as an effective method for building fault-tolerant agents capable of robust recovery in real-world tool environments.

PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

TL;DR

Tool-driven agents often fail in real-world deployments due to tool-timeouts and API errors. PALADIN presents a unified framework that (i) injects failures and trains on recovery-annotated trajectories (50K+) with a 55+ exemplar recovery bank, and (ii) augments inference with taxonomy-guided retrieval to enact recovery actions for unseen failures. The approach yields substantial gains on execution robustness metrics, notably raising Recovery Rate from to and improving Task Success Rate and Catastrophic Success Rate across multiple backbones, while maintaining reasonable efficiency. PALADIN’s recovery supervision and retrieval-based recovery demonstrate that robust, fault-tolerant tool use is a learnable, transferable capability, enabling safer, more reliable real-world deployments of tool-augmented LLM agents.

Abstract

Tool-augmented language agents frequently fail in real-world deployment due to tool malfunctions--timeouts, API exceptions, or inconsistent outputs--triggering cascading reasoning errors and task abandonment. Existing agent training pipelines optimize only for success trajectories, failing to expose models to the tool failures that dominate real-world usage. We propose \textbf{PALADIN}, a generalizable framework for equipping language agents with robust failure recovery capabilities. PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection and expert demonstrations on an enhanced ToolBench dataset. Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence. At inference, PALADIN detects execution-time errors and retrieves the most similar case from a curated bank of 55+ failure exemplars aligned with ToolScan's taxonomy, then executes the corresponding recovery action. This approach generalizes to novel failures beyond the training distribution, retaining 95.2\% recovery performance on unseen tool APIs. Evaluation across PaladinEval and ToolReflectEval demonstrates consistent improvements in Recovery Rate (RR), Task Success Rate (TSR), Catastrophic Success Rate (CSR), and Efficiency Score (ES). PALADIN improves RR from 32.76% to 89.68% (+57% relative) over ToolBench and outperforms the strongest baseline CRITIC (76.34%) by +13.3%. Against vanilla agents, PALADIN achieves 89.86\% RR (+66% relative improvement from 23.75%). These results establish PALADIN as an effective method for building fault-tolerant agents capable of robust recovery in real-world tool environments.

Paper Structure

This paper contains 58 sections, 10 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Recovery Rate vs Task Success Rate comparison between CRITIC and PALADIN across different LLMs. Stars indicate PALADIN results, circles indicate CRITIC baseline. Dotted lines show improvement trajectories from CRITIC to PALADIN for each model. PALADIN consistently achieves higher task success rates while maintaining superior error recovery capabilities.
  • Figure 2: Our tool-use simulator with integrated error injection and recovery mechanisms. (a) depicts the static architecture, where errors are injected into tool calls and handled via a recovery dictionary. (b) details the dynamic execution loop, capturing assistant reasoning, recovery actions, and final outcomes. This design allows controlled, reproducible evaluation of LLM resilience to tool failures.
  • Figure 3: Trace repair pipeline for constructing the Error-Trajectory Dataset. Each ToolBench trace is truncated at the first failure, then repaired or finalized via GPT-guided recovery. Outputs are stored with recovery metadata to construct PALADIN’s training corpus.
  • Figure 4: PALADIN’s performance without inference time error matching compared to baseline across Gemma, Qwen, LLaMA, and AM-Thinking backbones. Refer Figure \ref{['fig:Abalation-amthinking']} for full ablation graphs and Figure \ref{['fig:generalization']} for full generalization graphs.
  • Figure 5: PALADIN's Thought Process
  • ...and 10 more figures