Table of Contents
Fetching ...

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, Yi Zeng

TL;DR

This work identifies response entropy as an intrinsic safety signal in aligned LLMs, where low-entropy refusals indicate safe behavior and high-entropy responses correlate with unsafe content. It proposes Safety Instincts Reinforcement Learning (SIRL), which uses negative entropy as an internal reward to self-improve safety without human annotations or reward models. Through extensive experiments on multiple models and attack types, SIRL achieves Defense Success Rates above 89% (often >98%) while preserving or improving mathematics, coding, and conversational abilities, using only 15,000 unlabeled prompts. The results demonstrate that robust AI safety can emerge from within the model by reinforcing its own safety instincts, offering a scalable approach to defense against evolving jailbreak threats.

Abstract

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

TL;DR

This work identifies response entropy as an intrinsic safety signal in aligned LLMs, where low-entropy refusals indicate safe behavior and high-entropy responses correlate with unsafe content. It proposes Safety Instincts Reinforcement Learning (SIRL), which uses negative entropy as an internal reward to self-improve safety without human annotations or reward models. Through extensive experiments on multiple models and attack types, SIRL achieves Defense Success Rates above 89% (often >98%) while preserving or improving mathematics, coding, and conversational abilities, using only 15,000 unlabeled prompts. The results demonstrate that robust AI safety can emerge from within the model by reinforcing its own safety instincts, offering a scalable approach to defense against evolving jailbreak threats.

Abstract

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

Paper Structure

This paper contains 46 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Entropy reveals intrinsic safety signals. (a) SIRL teaches models to trust low-entropy refusals over uncertain compliance. (b) Entropy distributions for safe vs. unsafe outputs under jailbreak attacks.
  • Figure 2: Token-level entropy reveals safety confidence patterns. (a) Entropy across token positions: safe responses maintain low entropy, unsafe ones show high variability. (b) Entropy by token function: Risk Articulation $<$ General $<$ Compliance Signals. (c) Example: lottery scam response showing, per-token entropy differences.
  • Figure 3: DSR heatmaps across diverse jailbreak attacks.
  • Figure 4: DSRs (%) against adaptive attacks.
  • Figure 5: Effect of KL divergence coefficient $\beta$.
  • ...and 6 more figures