Table of Contents
Fetching ...

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

Tanay Gondil

Abstract

Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

Abstract

Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

Paper Structure

This paper contains 40 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Signal detection analysis. (a) Overall introspective sensitivity ($d'$). (b) Sensitivity drops substantially at safety boundaries (red shaded region), with GPT-5.2 showing the largest degradation.
  • Figure 2: Topic-wise introspection accuracy. Weapons queries are hardest across all models; Llama shows uniformly lower accuracy due to refusal bias.
  • Figure 3: Error analysis. (a) Errors peak at Level 4, not Level 3. (b) Claude and Llama show FP-dominant errors (refusal bias); GPT-5.2 is more balanced. (c) Errors by confidence level.
  • Figure 4: Confidence-based routing. (a) Accuracy-coverage trade-off: Claude models achieve near-perfect accuracy at high confidence, while Llama's poor calibration prevents effective routing. (b) High-confidence accuracy with 95% CI.
  • Figure 5: Boundary analysis. (a) Number of boundary cases per model: GPT-5.2 has the most, Llama has fewest (due to refusal bias). (b) Behavioral consistency across paraphrases.
  • ...and 2 more figures