Table of Contents
Fetching ...

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Hayfa Dhabhi, Kashyap Thimmaraju

TL;DR

This work reframes LLM safety as a sequential checkpointing pipeline with four distinct checkpoints—CP1 through CP4—spanning input vs output processing and literal vs intent detection. It introduces the Four-Checkpoint Framework and 13 checkpoint-targeted evasion techniques, enabling controlled, single-turn black-box evaluations across frontier models using Weighted Attack Success Rate to capture partial information leakage. Across GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, results reveal strong input-literal defenses but weak output-stage defenses, with CP3 and CP4 driving the highest bypass rates while CP1 remains comparatively robust. The findings highlight the need to strengthen output-stage filtering and to adopt leakage-aware metrics beyond binary success, offering a structured approach for diagnosing and improving LLM safety in deployed systems.

Abstract

Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

TL;DR

This work reframes LLM safety as a sequential checkpointing pipeline with four distinct checkpoints—CP1 through CP4—spanning input vs output processing and literal vs intent detection. It introduces the Four-Checkpoint Framework and 13 checkpoint-targeted evasion techniques, enabling controlled, single-turn black-box evaluations across frontier models using Weighted Attack Success Rate to capture partial information leakage. Across GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, results reveal strong input-literal defenses but weak output-stage defenses, with CP3 and CP4 driving the highest bypass rates while CP1 remains comparatively robust. The findings highlight the need to strengthen output-stage filtering and to adopt leakage-aware metrics beyond binary success, offering a structured approach for diagnosing and improving LLM safety in deployed systems.

Abstract

Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3 higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.
Paper Structure (117 sections, 3 equations, 13 figures, 16 tables)

This paper contains 117 sections, 3 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Comparison of model responses in Normal vs. Jailbreak modes.
  • Figure 2: Output filtering architectures. (A) Partial detection monitors tokens during generation and can early-stop harmful outputs. (B) Full detection evaluates complete responses post-generation. (C) Regeneration retries generation when harmful content is detected, up to N attempts.
  • Figure 3: The LLM Safety Pipeline. Safety mechanisms are organized along two dimensions: processing stage (input vs. output) and detection level (literal vs. intent). Each checkpoint represents a distinct defensive layer.
  • Figure 4: CP1 Leet Speak transformation. Characters are substituted with visually similar symbols (a$\rightarrow$@, e$\rightarrow$3, o$\rightarrow$0).
  • Figure 5: CP2 Research Framing transformation. The harmful request is embedded within academic context with credibility markers.
  • ...and 8 more figures