Table of Contents
Fetching ...

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan

TL;DR

The paper tackles the mismatch between safety-focused refusals and local semantic stability in near-identical prompts. It introduces semantic confusion and ParaGuard, a curated 10k-prompt corpus of controlled paraphrase clusters, and proposes three token-level metrics—Confusion Index (CI), Confusion Rate (CR), and Confusion Depth (CD)—to diagnose neighborhood-level inconsistencies. By evaluating diverse models and guards, the work shows that global false-rejection metrics obscure structured patterns, including globally unstable boundaries and localized pockets of confusion, and demonstrates how confusion-aware auditing can separate how often a system refuses from how sensibly it refuses. The results provide practitioners with actionable signals to reduce false refusals while maintaining safety, through neighborhood-aware diagnostics and token-level analyses that generalize across guard architectures.

Abstract

Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

TL;DR

The paper tackles the mismatch between safety-focused refusals and local semantic stability in near-identical prompts. It introduces semantic confusion and ParaGuard, a curated 10k-prompt corpus of controlled paraphrase clusters, and proposes three token-level metrics—Confusion Index (CI), Confusion Rate (CR), and Confusion Depth (CD)—to diagnose neighborhood-level inconsistencies. By evaluating diverse models and guards, the work shows that global false-rejection metrics obscure structured patterns, including globally unstable boundaries and localized pockets of confusion, and demonstrates how confusion-aware auditing can separate how often a system refuses from how sensibly it refuses. The results provide practitioners with actionable signals to reduce false refusals while maintaining safety, through neighborhood-aware diagnostics and token-level analyses that generalize across guard architectures.

Abstract

Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.

Paper Structure

This paper contains 31 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Top: ParaGuard construction from OR-Bench, USEBench, and PHTest; each seed yields five intent-preserving variants via lexical/register edits, keyword softening or hardening, and controlled rewrites. Candidates pass three gates (sentence similarity, nontrivial rewrite, safety risk), producing a $\sim$10,000-prompt corpus with similarity, overlap, and risk annotations. Bottom: Confusion measurement. For each rejected prompt $r$, retrieve its top-$k$ accepted neighbors with FAISS and compute pairwise token-level drift, probability shift, and perplexity delta. Averaging over neighbors gives $CI(r)$; aggregating over rejections yields CI (mean), CR@$\,\tau$ (share above threshold), and CD (spread).
  • Figure 2: t-SNE projection of token embeddings colored by token-level confusion $\mathrm{CI}_{\text{tok}}$. Bright regions correspond to dense, highly confusable neighborhoods; darker regions indicate more isolated, semantically distinctive tokens.
  • Figure 3: Prompt- vs. token-level confusion. Each point is a rejected prompt: the $x$-axis shows prompt-level cosine similarity to accepted neighbors, and the $y$-axis shows our token-level confusion score. The strong vertical spread indicates that token-level confusion is only weakly related to prompt-level similarity.
  • Figure 4: Token-level confusion within prompt-similarity bands. Prompts are grouped by prompt-level cosine similarity, and we plot the distribution of token-level confusion in each bin. Wide, non-collapsing violins—even for $[0.9,1.0]$—show that near-identical prompt embeddings can still yield very different confusion scores.
  • Figure 5: Where semantic confusion concentrates.Top: FRR (left) and confusion rate (right) over low/high risk and seed-similarity bins, showing that FRR mainly tracks risk while confusion spikes in the low-risk, high-similarity region. Bottom: in the high-similarity slice, FRR stays flat across lexical-overlap bins, but the confusion rate among rejections rises sharply with overlap, indicating that refusals are most inconsistent for paraphrases that closely reuse the seed's wording.