How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

Zihan Ma; Dongsheng Zhu; Shudong Liu; Taolin Zhang; Junnan Liu; Qingqiu Li; Minnan Luo; Songyang Zhang; Kai Chen

How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

Zihan Ma, Dongsheng Zhu, Shudong Liu, Taolin Zhang, Junnan Liu, Qingqiu Li, Minnan Luo, Songyang Zhang, Kai Chen

TL;DR

The paper addresses the risk of LLM-driven agents operating under concealed intent within complex tasks by introducing OASIS, a two-dimensional benchmark along Task Complexity and Intent Concealment, paired with a high-fidelity 53-tool simulation sandbox. It provides per-step harm annotations and ground-truth toolchains to enable granular analysis of safety boundaries, and defines metrics such as Hierarchical Refusal Rate ($HRR$) and Harm Progression Score ($HPS$) to quantify safety performance. The findings show safety alignment degrades sharply with increasing concealment and exhibits a non-monotonic Complexity Paradox where greater task complexity can obscure harm due to planning limits; many agents rely on static pre-execution refusals, though some (notably the GPT-5 family) implement dynamic, in-workflow refusals with significantly reduced harm progression. By releasing OASIS, its dataset, and the simulation environment, the work provides a principled framework to evaluate and strengthen safety in overlooked dimensions of agent autonomy and planning under uncertainty.

Abstract

Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a "Complexity Paradox" emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked dimensions.

How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

TL;DR

Abstract

How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)