Table of Contents
Fetching ...

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan, Victor Li, Qi Lei

TL;DR

The paper tackles safety in large language models by introducing Progressive Self-Reflection (PSR), an inference-time mechanism that interleaves generation with self-evaluation to backtrack unsafe outputs, all without modifying model weights. PSR uses a lightweight adaptive predictor to schedule reflection rounds based on input risk, achieving substantial reductions in jailbreak attack success across open-source models while largely preserving performance on benign tasks. It treats safety as a test-time scaling problem, trading additional inference time for significantly enhanced safety, and can outperform external guardrails when given enough compute. The approach demonstrates practical, training-free improvements in robustness, offering a scalable path toward safer LLM deployment in real-world settings.

Abstract

Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

TL;DR

The paper tackles safety in large language models by introducing Progressive Self-Reflection (PSR), an inference-time mechanism that interleaves generation with self-evaluation to backtrack unsafe outputs, all without modifying model weights. PSR uses a lightweight adaptive predictor to schedule reflection rounds based on input risk, achieving substantial reductions in jailbreak attack success across open-source models while largely preserving performance on benign tasks. It treats safety as a test-time scaling problem, trading additional inference time for significantly enhanced safety, and can outperform external guardrails when given enough compute. The approach demonstrates practical, training-free improvements in robustness, offering a scalable path toward safer LLM deployment in real-world settings.

Abstract

Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

Paper Structure

This paper contains 28 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our proposed method. Given a potentially harmful user prompt (top-left), the LLM (bottom-left) generates an initial response "Sure, here is a guide for stealing from a store without getting caught" and begins to generate unsafe content, denote in red. Before completing the harmful response, a self-reflection prompt is injected (e.g. "Let’s check if the generated text is harmful or harmless”), allowing the model to assess its own output. If the response is deemed harmful, the model backtracks and regenerates a safer alternative. Otherwise, the LLM continues generating without being affected by the probing tokens.
  • Figure 2: Kernel density estimates (KDEs) of the normalized harmful probability, computed as $p_\theta\left(w_{\text{harm }} \mid \operatorname{Prompt}\left(x_{1: t}\right)\right) / ( p_\theta\left(w_{\text{safe }} \mid \operatorname{Prompt}\left(x_{1: t}\right)\right) + p_\theta\left(w_{\text{harm }} \mid \operatorname{Prompt}\left(x_{1: t}\right)\right)$, across various evaluation datasets. Each subplot corresponds to a different language model variant: (a) Llama-3.1-8B-Instruct, (b) Qwen2.5-7B-Instruct, and (c) Llama-3.1-8B (base). Datasets include adversarial, jailbreak, and safety-specific benchmarks (e.g., AdvBench, JailbreakBench, HexPHI), as well as non-adversarial tasks (e.g., GSM8K, SAMSUM) for contrast. Sharp peaks near zero correspond to non-harmful generations, while wider or shifted distributions indicate model uncertainty or increased likelihood of harmful content.
  • Figure 3: Attack success rate (ASR) on AdvBench prefilling attack and inference time spent on benign SamSum dataset for Llama-3.1-8B-Instruct (blue) and Qwen2.5-7B-Instruct (orange) under varying numbers of self-reflection rounds (n). As n increases, the models exhibit a substantial drop in ASR-indicating greater robustness to adversarial prompts-at the cost of a notable rise in inference time.
  • Figure 4: Hyperparameter sensitivity of PSR. All results are averaged over five random seeds. Please note that these runtime measurements were obtained on a different hardware configuration and using only the first 100 SAMSUM samples; as a result, they may not exactly match the figures in Figure \ref{['fig:asr-time']}.
  • Figure 5: t-SNE of the Last Generated Token by Dataset Different markers denote the dataset (e.g., SamSum, AdvBench, SimpleSafetyTests), while the color scale indicates the number of self-reflection rounds (from 0 to 4). .
  • ...and 1 more figures