Table of Contents
Fetching ...

How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

Zihan Ma, Dongsheng Zhu, Shudong Liu, Taolin Zhang, Junnan Liu, Qingqiu Li, Minnan Luo, Songyang Zhang, Kai Chen

TL;DR

The paper addresses the risk of LLM-driven agents operating under concealed intent within complex tasks by introducing OASIS, a two-dimensional benchmark along Task Complexity and Intent Concealment, paired with a high-fidelity 53-tool simulation sandbox. It provides per-step harm annotations and ground-truth toolchains to enable granular analysis of safety boundaries, and defines metrics such as Hierarchical Refusal Rate ($HRR$) and Harm Progression Score ($HPS$) to quantify safety performance. The findings show safety alignment degrades sharply with increasing concealment and exhibits a non-monotonic Complexity Paradox where greater task complexity can obscure harm due to planning limits; many agents rely on static pre-execution refusals, though some (notably the GPT-5 family) implement dynamic, in-workflow refusals with significantly reduced harm progression. By releasing OASIS, its dataset, and the simulation environment, the work provides a principled framework to evaluate and strengthen safety in overlooked dimensions of agent autonomy and planning under uncertainty.

Abstract

Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a "Complexity Paradox" emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked dimensions.

How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

TL;DR

The paper addresses the risk of LLM-driven agents operating under concealed intent within complex tasks by introducing OASIS, a two-dimensional benchmark along Task Complexity and Intent Concealment, paired with a high-fidelity 53-tool simulation sandbox. It provides per-step harm annotations and ground-truth toolchains to enable granular analysis of safety boundaries, and defines metrics such as Hierarchical Refusal Rate () and Harm Progression Score () to quantify safety performance. The findings show safety alignment degrades sharply with increasing concealment and exhibits a non-monotonic Complexity Paradox where greater task complexity can obscure harm due to planning limits; many agents rely on static pre-execution refusals, though some (notably the GPT-5 family) implement dynamic, in-workflow refusals with significantly reduced harm progression. By releasing OASIS, its dataset, and the simulation environment, the work provides a principled framework to evaluate and strengthen safety in overlooked dimensions of agent autonomy and planning under uncertainty.

Abstract

Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a "Complexity Paradox" emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked dimensions.

Paper Structure

This paper contains 27 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Agent safety is brittle. A direct harmful instruction (top) may be refused, but the same action can be executed when embedded as a sub-task in a complex workflow with concealed intent (bottom). This motivates our two-dimensional analysis of Task Complexity and Intent Concealment.
  • Figure 2: The OASIS evaluation workflow. A task, defined by its position on the Task Complexity and Intent Concealment axes, is passed to the agent. The agent interacts with the stateful simulation sandbox. Its execution trace is then evaluated against per-step harm labels under Realistic and Idealized scenarios to generate fine-grained safety metrics.
  • Figure 3: Dimensional safety profiles for each agent. Each subplot shows the full 3×3 matrix across Task Complexity (x-axis: L1–L3) and Intent Concealment (Low, Medium, High). Dark, solid bars depict the Realistic refusal rate (operational safety), while light, semi-transparent bars depict the Idealized rate (intrinsic safety). When the Idealized value is lower than the Realistic bar and would be occluded, a dashed horizontal line marks the Idealized level to ensure visibility without changing bar widths.
  • Figure 4: The Complexity-Safety Tradeoff (Gap) across dimensions. (a-b) The Tradeoff for each model, averaged across (a) Task Complexity and (b) Intent Concealment levels. Darker cells indicate a more severe degradation. (c) The mean overall Tradeoff for each model, summarizing its safety brittleness under operational pressure.
  • Figure 5: (a) Composition of task outcomes for each agent, showing the proportion of static (pre-execution) refusals, dynamic (post-execution) refusals, and safety failures. (b) Characterization of safety mechanisms. Agents are plotted by their dynamic monitoring rate (x-axis) and the resulting harm (y-axis, HPS), allowing for classification into archetypes like 'Dynamic and Effective' (bottom-right) and 'Static Failure' (top-left).
  • ...and 1 more figures