Table of Contents
Fetching ...

Self-Reflective Generation at Test Time

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

TL;DR

SRGen tackles the brittleness of forward-only LLM reasoning by introducing a proactive, test-time self-reflection framework. It senses high-uncertainty tokens through a dynamic entropy threshold and applies a brief, transient correction δ to the hidden state, optimizing a hybrid objective that balances fidelity to prior tokens with reducing future uncertainty. Theoretical grounding via a constrained-optimization perspective explains the loss blend and the role of the trade-off parameter, while experiments on math benchmarks across diverse models show consistent gains in single-pass accuracy and improved self-consistency, with bounded overhead. The method is plug-and-play, training-free, and complementary to other test-time techniques like SLOT, making it a practical approach for more reliable and efficient LLM reasoning in real-world settings.

Abstract

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

Self-Reflective Generation at Test Time

TL;DR

SRGen tackles the brittleness of forward-only LLM reasoning by introducing a proactive, test-time self-reflection framework. It senses high-uncertainty tokens through a dynamic entropy threshold and applies a brief, transient correction δ to the hidden state, optimizing a hybrid objective that balances fidelity to prior tokens with reducing future uncertainty. Theoretical grounding via a constrained-optimization perspective explains the loss blend and the role of the trade-off parameter, while experiments on math benchmarks across diverse models show consistent gains in single-pass accuracy and improved self-consistency, with bounded overhead. The method is plug-and-play, training-free, and complementary to other test-time techniques like SLOT, making it a practical approach for more reliable and efficient LLM reasoning in real-world settings.

Abstract

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

Paper Structure

This paper contains 30 sections, 1 theorem, 30 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Given a trade-off parameter $\lambda \in (0,1)$, the minimizer $\delta^*$ of the hybrid loss objective is also the solution to the constrained optimization problem The choice of $\lambda$ implicitly defines the constraint boundary $\epsilon = \mathcal{L}_{\text{CE}}(\delta^*)$, establishing a formal equivalence between tuning the loss weight and setting a fidelity tolerance.

Figures (9)

  • Figure 1: An overview of the Self-Reflective Generation (SRGen) framework. This framework consists of two main stages. (1) Uncertainty Monitoring. A threshold is dynamically computed from the mean and standard deviation of token entropies within a recent history window of size N. (2) Self-Reflective Optimization. If the current token's entropy exceeds the threshold, a correction vector, $\delta$, is optimized on-the-fly using a joint loss of cross-entropy and entropy minimization. This $\delta$ is then added to the token's hidden state to steer the final decision towards a more reliable outcome.
  • Figure 2: Activations and Time Increase.
  • Figure 3: Cons@k and Pass@k accuracy of Qwen2.5-Math-7B on the AMC benchmark.
  • Figure 4: Performance of SLOT, SRGen, and their combination for Qwen2.5-Math-7B across the AMC, MATH500, and AIME2024 benchmarks.
  • Figure 5: Ablation analysis of the balancing parameter $\lambda$, window size $N$, and standard-deviation multiplier $k$.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1: Hybrid Loss as Principled Constrained Optimization