Table of Contents
Fetching ...

Quantifying Distributional Robustness of Agentic Tool-Selection

Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh

TL;DR

ToolCert formalizes tool selection in agentic LLMs as a Bernoulli-trial robustness problem against an adaptive adversary that iteratively injects deceptive tools. By simulating multi-round interactions and using Clopper-Pearson bounds, it provides a high-confidence lower bound on robust accuracy, revealing severe fragility of both retrieval and selection stages under adaptive attacks. The framework demonstrates that even with strong retrievers, the selector remains highly susceptible to metadata-driven manipulation, necessitating robustness certification for safe deployment. Empirical results across multiple state-of-the-art models show substantial drops in certified robustness under adversarial amplification, highlighting critical security risks in open tool ecosystems and guiding future defense research.

Abstract

Large language models (LLMs) are increasingly deployed in agentic systems where they map user intents to relevant external tools to fulfill a task. A critical step in this process is tool selection, where a retriever first surfaces candidate tools from a larger pool, after which the LLM selects the most appropriate one. This pipeline presents an underexplored attack surface where errors in selection can lead to severe outcomes like unauthorized data access or denial of service, all without modifying the agent's model or code. While existing evaluations measure task performance in benign settings, they overlook the specific vulnerabilities of the tool selection mechanism under adversarial conditions. To address this gap, we introduce ToolCert, the first statistical framework that formally certifies tool selection robustness. ToolCert models tool selection as a Bernoulli success process and evaluates it against a strong, adaptive attacker who introduces adversarial tools with misleading metadata, and are iteratively refined based on the agent's previous choices. By sampling these adversarial interactions, ToolCert produces a high-confidence lower bound on accuracy, formally quantifying the agent's worst-case performance. Our evaluation with ToolCert uncovers the severe fragility: under attacks injecting deceptive tools or saturating retrieval, the certified accuracy bound drops near zero, an average performance drop of over 60% compared to non-adversarial settings. For attacks targeting the retrieval and selection stages, the certified accuracy bound plummets to less than 20% after just a single round of adversarial adaptation. ToolCert thus reveals previously unexamined security threats inherent to tool selection and provides a principled method to quantify an agent's robustness to such threats, a necessary step for the safe deployment of agentic systems.

Quantifying Distributional Robustness of Agentic Tool-Selection

TL;DR

ToolCert formalizes tool selection in agentic LLMs as a Bernoulli-trial robustness problem against an adaptive adversary that iteratively injects deceptive tools. By simulating multi-round interactions and using Clopper-Pearson bounds, it provides a high-confidence lower bound on robust accuracy, revealing severe fragility of both retrieval and selection stages under adaptive attacks. The framework demonstrates that even with strong retrievers, the selector remains highly susceptible to metadata-driven manipulation, necessitating robustness certification for safe deployment. Empirical results across multiple state-of-the-art models show substantial drops in certified robustness under adversarial amplification, highlighting critical security risks in open tool ecosystems and guiding future defense research.

Abstract

Large language models (LLMs) are increasingly deployed in agentic systems where they map user intents to relevant external tools to fulfill a task. A critical step in this process is tool selection, where a retriever first surfaces candidate tools from a larger pool, after which the LLM selects the most appropriate one. This pipeline presents an underexplored attack surface where errors in selection can lead to severe outcomes like unauthorized data access or denial of service, all without modifying the agent's model or code. While existing evaluations measure task performance in benign settings, they overlook the specific vulnerabilities of the tool selection mechanism under adversarial conditions. To address this gap, we introduce ToolCert, the first statistical framework that formally certifies tool selection robustness. ToolCert models tool selection as a Bernoulli success process and evaluates it against a strong, adaptive attacker who introduces adversarial tools with misleading metadata, and are iteratively refined based on the agent's previous choices. By sampling these adversarial interactions, ToolCert produces a high-confidence lower bound on accuracy, formally quantifying the agent's worst-case performance. Our evaluation with ToolCert uncovers the severe fragility: under attacks injecting deceptive tools or saturating retrieval, the certified accuracy bound drops near zero, an average performance drop of over 60% compared to non-adversarial settings. For attacks targeting the retrieval and selection stages, the certified accuracy bound plummets to less than 20% after just a single round of adversarial adaptation. ToolCert thus reveals previously unexamined security threats inherent to tool selection and provides a principled method to quantify an agent's robustness to such threats, a necessary step for the safe deployment of agentic systems.

Paper Structure

This paper contains 52 sections, 6 equations, 8 figures, 5 tables, 4 algorithms.

Figures (8)

  • Figure 1: Attack surfaces in the tool-selection pipeline. (i) Unregulated tool pools, where anyone can publish tools with misleading or unsafe metadata; (ii) Retriever dependence, where only a top-$N$ slate of candidates is surfaced to the agent, making semantic similarity an exploitable weakness; and (iii) Metadata-driven selection, where the agent must parse natural-language descriptions to decide which tool to invoke, exposing it to manipulation and prompt injection.
  • Figure 2: Certified Robustness of LLM Agents Against Primary Attack Families. Each panel compares the clean accuracy (blue) against the certified lower bound on robust accuracy (orange) for the four defender models, all evaluated against the same representative strong attacker model. The certified bound shown is the 95% Clopper-Pearson lower bound on the success probability under a multi-round ($R=10$) attack.
  • Figure 3: Causal ablation isolating retrieval vs. selection effects.
  • Figure 4: Certified robustness under Adversarial Selection attacks. The plot shows a catastrophic and uniform collapse in robust accuracy across all 16 unique attacker-defender pairs, indicating a critical vulnerability.
  • Figure 5: Certified robustness under Top-$N$ Saturation attacks. Similar to adversarial selection, this attack is highly effective, causing a near-total failure in robust accuracy across almost all model pairings.
  • ...and 3 more figures