Table of Contents
Fetching ...

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Arnold Cartagena, Ariane Teixeira

Abstract

Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action--a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Abstract

Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action--a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.
Paper Structure (74 sections, 1 equation, 5 figures, 10 tables)

This paper contains 74 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Experimental pipeline. Each scenario is evaluated under all combinations of system prompt condition, prompt variant, model, and governance mode, yielding 17,420 analysis-ready rows. Multipliers on edges show the cross-product expansion at each stage. The scoring pipeline derives five metrics from each interaction: TC-safe and T-safe are evaluated independently; GAP and LEAK are conjunctions thereof (Equations \ref{['eq:tcsafe']}--\ref{['eq:leak']}).
  • Figure 2: TC-safe rates by model and system prompt condition (jailbreak scenarios only, $n \approx 756$ per cell). Error bars show 95% Clopper-Pearson confidence intervals. All models show substantial improvement under safety-reinforced conditions. DeepSeek V3.2 exhibits a reversed pattern under tool-encouraging conditions (Section \ref{['sec:results-deepseek']}).
  • Figure 3: Outcome distribution per model under each system prompt condition (jailbreak scenarios only). Every interaction falls into exactly one of four mutually exclusive categories: TC-safe (no forbidden tool call), GAP (text refusal with forbidden tool call), LEAK (forbidden tool call with PII surfaced), or unsafe other (forbidden tool call, no text refusal, no PII surfaced). Bars sum to 100%. LEAK dominates the unsafe portion for most models; GAP is most visible for GPT-5.2 under tool-encouraging conditions, consistent with its 79.3% conditional GAP rate (Section \ref{['sec:results-gap']}).
  • Figure 4: Prompt sensitivity as TC-safe range across system prompt conditions. Each line spans from the tool-encouraging rate (diamond) to the safety-reinforced rate (square), with neutral marked (circle). GPT-5.2 exhibits the widest range (57 pp), indicating highly prompt-contingent safety. Claude exhibits the narrowest (21 pp), indicating training-intrinsic safety. Models are sorted by range.
  • Figure 5: False-positive rates of the forbidden action predicates on legitimate-use control scenarios ($n = 3{,}887$ total control interactions). DeepSeek's 14.2% rate is an outlier driven by aggressive tool-calling behavior; the remaining five models range from 0.0% to 4.5%. Dashed line indicates the overall 3.8% rate.