AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Ruipeng Wang; Yuxin Chen; Yukai Wang; Chang Wu; Junfeng Fang; Xiaodong Cai; Qi Gu; Hui Su; An Zhang; Xiang Wang; Xunliang Cai; Tat-Seng Chua

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, Tat-Seng Chua

TL;DR

This work addresses the gap between idealized benchmark performance and real-world robustness of tool-using LLM agents by introducing AgentNoiseBench, a framework that systematically injects realistic user and tool perturbations while preserving task solvability. It combines a data-driven noise taxonomy with a constrained adversarial noise injection pipeline and a trajectory-aware evaluation protocol to measure robustness across diverse models and tasks. Key findings show widespread performance degradation under noise, with tool-noise generally more disruptive than user-noise, and reveal that strong reasoning capability does not guarantee robustness due to noise-induced spurious reasoning. The framework and findings have practical impact by guiding noise-aware training and evaluation strategies to build more trustworthy agents in imperfect environments.

Abstract

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 7 figures, 11 tables)

This paper contains 27 sections, 4 equations, 7 figures, 11 tables.

Introduction
AgentNoiseBench: A Systematic Framework for Robustness Evaluation
Design Principles
Benchmark Construction
Empirical Taxonomy of Noise
User Noise ($\mathcal{N}_{\text{user}}$):
Tool Noise ($\mathcal{N}_{\text{tool}}$):
Constrained Adversarial Noise Injection
Trajectory-Aware Evaluation Protocol
Experiments
Experimental Setup
Noise Robustness Remains an Open Challenge for Current Agents
Misalignment Exists between Reasoning Ability and Robustness
How Noise Disrupts Reasoning via Entropy Injection
Significant Performance Disparities Induced by Noise from Different Sources and Granularities
...and 12 more sections

Figures (7)

Figure 1: Overall scores on AgentNoiseBench, sorted by model performance under noisy conditions.
Figure 2: The framework of AgentNoiseBench. (A)Empirical Noise Taxonomy categorizes real-world noise into instruction noise and tool-execution noise. (B)Constrained Adversarial Noise Evolution & Injection mechanism applies controlled noise while ensuring task solvability. (C)Trajectory-Aware Evaluation Protocol assesses agent behavior and robustness through multi-dimensional metrics.
Figure 3: Relative deviation metrics including Accuracy (ACC) and Inference Steps (STEP) for Think and Non-Think models across diverse scenarios. For brevity, prefix "V-", "T-" and "S-" stands for VitaBench, $\tau^{2}$-Bench, and Search.
Figure 4: Figure (a) and (b) illustrate the entropy at each reasoning step for thinking and non-thinking models under non-noise, user noise, and tool noise settings, respectively.
Figure 5: The impact of different noises on agent performance. (a) Average impact of nine fine-grained noise categories (user-side "U-" and tool-side "T-") across all scenarios. (b) Performance degradation caused by user noise vs. tool noise in non-reasoning and reasoning-enabled models.
...and 2 more figures

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

TL;DR

Abstract

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)