Table of Contents
Fetching ...

The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents

Tanmay Sah, Vishal Srivastava, Dolly Sah, Kayden Jordan

Abstract

We study how runtime enforcement against unsafe actions affects end-to-end task performance in multi-step tool using large language model (LLM) agents. Using tau-bench across Airline and Retail domains, we compare baseline Tool-Calling, planning-integrated (TRIAD), and policy-mediated (TRIAD-SAFETY) architectures with GPT-OSS-20B and GLM-4-9B. We identify model dependent interaction horizons (15 to 30 turns) and decompose outcomes into overall success rate (SR), safe success rate (SSR), and unsafe success rate (USR). Our results reveal a persistent Safety Capability Gap. While safety mediation can intercept up to 94 percent of non-compliant actions, it rarely translates into strictly safe goal attainment (SSR below 5 percent in most settings). We find that high unsafe success rates are primarily driven by Integrity Leaks, where models hallucinate user identifiers to bypass mandatory authentication. Recovery rates following blocked actions are consistently low, ranging from 21 percent for GPT-OSS-20B in simpler procedural tasks to near zero in complex Retail scenarios. These results demonstrate that runtime enforcement imposes a significant verifier tax on conversational length and compute cost without guaranteeing safe completion, highlighting the critical need for agents capable of grounded identity verification and post-intervention reasoning.

The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents

Abstract

We study how runtime enforcement against unsafe actions affects end-to-end task performance in multi-step tool using large language model (LLM) agents. Using tau-bench across Airline and Retail domains, we compare baseline Tool-Calling, planning-integrated (TRIAD), and policy-mediated (TRIAD-SAFETY) architectures with GPT-OSS-20B and GLM-4-9B. We identify model dependent interaction horizons (15 to 30 turns) and decompose outcomes into overall success rate (SR), safe success rate (SSR), and unsafe success rate (USR). Our results reveal a persistent Safety Capability Gap. While safety mediation can intercept up to 94 percent of non-compliant actions, it rarely translates into strictly safe goal attainment (SSR below 5 percent in most settings). We find that high unsafe success rates are primarily driven by Integrity Leaks, where models hallucinate user identifiers to bypass mandatory authentication. Recovery rates following blocked actions are consistently low, ranging from 21 percent for GPT-OSS-20B in simpler procedural tasks to near zero in complex Retail scenarios. These results demonstrate that runtime enforcement imposes a significant verifier tax on conversational length and compute cost without guaranteeing safe completion, highlighting the critical need for agents capable of grounded identity verification and post-intervention reasoning.
Paper Structure (63 sections, 6 equations, 3 figures, 6 tables)

This paper contains 63 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Annotated Triad-Safety successful recovery trajectory (Task 41, GPT-OSS-20B, Airline, from native-triad-safety-airline_gpt-oss-20b_airline_..._192419.jsonl). Red rows indicate verifier rejections under rule [CANCELLATION_POLICY]; green rows indicate successful recovery actions. The agent appropriately verifies identity (Turn 3) and looks up the booking (Turn 6). It mistakenly decides to begin processing the unallowed refund based on the user's claim but is safely blocked by the verifier (Turn 8). The agent successfully recovers by explaining the policy (Turn 10). When the user pushes back on the booking time, the agent adheres to its internal retrieved data correctly (Turn 14) and escalates safely, achieving a perfect verified task reward of 1.0.
  • Figure 2: Tool-Calling Architecture (Baseline). The agent directly executes tool calls without internal mediation.
  • Figure 3: Triad / Triad-Safety Architecture. Triad-Safety introduces a verifier gate and block-and-revise loop with explicit policy mediation.