Table of Contents
Fetching ...

Internal Safety Collapse in Frontier Large Language Models

Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench

Internal Safety Collapse in Frontier Large Language Models

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench
Paper Structure (49 sections, 1 equation, 11 figures, 24 tables)

This paper contains 49 sections, 1 equation, 11 figures, 24 tables.

Figures (11)

  • Figure 1: Internal safety collapse during anomaly detection. A user asks a frontier LLM to build a text anomaly detector, a task that requires labeled examples of both normal and harmful text. Left: The model provides technical guidance, then generates toxic examples (red text) as training data. Right: The user requests longer multilingual examples and additional sensitive categories; the model reasons this is "a legitimate research purpose" and complies. No adversarial prompt is used here; the model complies because the task cannot succeed without harmful examples.
  • Figure 2: ISC-Bench includes 53 scenarios across 8 disciplines, each encoding a legitimate workflow that involves sensitive data (e.g., toxin structures for molecular docking, exploit payloads for security validation, and pathogen sequences for epidemiological modeling). When Claude, GPT, Gemini, and Grok perform these tasks, they comply without adversarial prompting, and none of the evaluated LLMs issue proactive refusals.
  • Figure 3: ISC vs. Jailbreak Attacks. Jailbreak attacks (Row 1) circumvent safety alignment by adversarially reformulating a harmful query. ISC shows that, when a legitimate task inherently requires harmful data for correct completion, frontier LLMs generate it spontaneously (Row 2), albeit inconsistently. TVD (Row 3) renders this phenomenon systematic and reproducible by encoding task requirements as domain-specific constraints (e.g., validator assertions, schema checks, and format specifications). All examples shown are real outputs produced by Claude, GPT, Gemini, and Grok.
  • Figure 4: Anchor and trigger in TVD. The anchor (pre-filled fields in ${\color{colorD}\mathcal{D}}$) steers what the LLM generates; the trigger (domain tool validation error) initiates generation. The same mechanism operates across domain tools with different data schemas.
  • Figure 5: Cross-domain verification. Fraction of ISC-Bench scenarios (53 total across 8 disciplines) in which the verification model generated domain-specific sensitive data, judged by GPT-5.2. Lighter bars denote rates below 100%. All five models produce sensitive data across every discipline; most variation appears in Llama 4 Maverick.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 3.1: Internal Safety Collapse
  • Definition 3.2: TVD Framework