Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov; Zaur Rajabov; Emil Mirzazada; Murat Kantarcioglu

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu

TL;DR

This work introduces SemSIEdit, an inference-time framework where an agentic"Editor"iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer, and identifies a Reasoning Paradox.

Abstract

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

TL;DR

Abstract

Paper Structure (25 sections, 8 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 8 figures, 7 tables, 1 algorithm.

Introduction
Related Work
SemSIEdit Framework
Experimental Setup
Empirical Results & Analysis
The Privacy-Utility Frontier
Scale-Dependent Safety Behaviors
The Double-Edged Sword of Reasoning
Epistemic Uncertainty & Hallucination
Robustness & Validation
Conclusion
Instruction Prompts
Preprocessing Prompt
Evaluator Prompt
Editor Prompt
...and 10 more sections

Figures (8)

Figure 1: Mitigation of Incorrect Hazardous Information. An example of SemSIEdit handling the "Incorrect Hazardous Information" category of SemSI as defined by zhang2025a.
Figure 2: Schematic overview of the SemSIEdit
Figure 3: The Privacy-Utility Pareto Frontier. A scatter plot visualizing the efficiency of SemSIEdit. The y-axis represents the Privacy Gain (reduction in SemSI), while the x-axis represents the Utility Cost (reduction in response quality). Ideally, methods should appear in the Top-Left quadrant (High Gain, Low Cost). The data reveals a favorable exchange rate: on average, models achieve a 34.6% reduction in semantic leakage while sacrificing only 9.8% in utility.
Figure 4: Convergence Efficiency by Model Scale. Larger models (Left) frequently satisfy safety constraints during the initialization phase (Pink), requiring zero additional feedback loops. In contrast, smaller models (Right) often fail to converge, exhausting the maximum iteration budget (Blue) without resolving the semantic leakage.
Figure 5: The "Rewrite vs. Truncate" Divergence. Comparing answer lengths before (Blue) and after (Orange) defense. While most models achieve safety by aggressively truncating responses (significant negative $\Delta$), GPT-5 uniquely increases answer length ($+110$ chars), demonstrating its capacity to generate nuanced, safe explanations rather than refusing.
...and 3 more figures

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

TL;DR

Abstract

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Authors

TL;DR

Abstract

Table of Contents

Figures (8)