Table of Contents
Fetching ...

Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique

Yansi Li, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Qiuzhi Liu, Rui Wang, Zhuosheng Zhang, Zhaopeng Tu, Haitao Mi, Dong Yu

TL;DR

Panel proposes a non-scalar, natural-language self-critique mechanism for stepwise inference-time reasoning in LLMs, replacing external verifiers with critique-generated feedback to guide a three-stage search (sampling, critique, decision). The approach enables richer justification for each candidate step and dynamically selects steps based on critique-informed reasoning, improving performance on challenging STEM benchmarks (AIME, GPQA) across model sizes. Empirical results show Panel outperforms both solution- and step-level self-evaluation baselines, with NL self-critique driving notable gains and early-step influence. The work suggests a scalable path to more robust, qualitative reasoning in diverse domains without task-specific verifiers, though future work should deepen theoretical understanding and explore broader applicability.

Abstract

Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.

Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique

TL;DR

Panel proposes a non-scalar, natural-language self-critique mechanism for stepwise inference-time reasoning in LLMs, replacing external verifiers with critique-generated feedback to guide a three-stage search (sampling, critique, decision). The approach enables richer justification for each candidate step and dynamically selects steps based on critique-informed reasoning, improving performance on challenging STEM benchmarks (AIME, GPQA) across model sizes. Empirical results show Panel outperforms both solution- and step-level self-evaluation baselines, with NL self-critique driving notable gains and early-step influence. The work suggests a scalable path to more robust, qualitative reasoning in diverse domains without task-specific verifiers, though future work should deepen theoretical understanding and explore broader applicability.

Abstract

Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.

Paper Structure

This paper contains 28 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An illustration of our proposed Panel, a novel inference time scaling framework that incorporates rich natural language self-critique to guide step-level search in reasoning tasks, moving beyond traditional scalar correctness scores. Rather than relying solely on scalar outputs from task-specific verifiers, Panel employs natural language feedback to offer nuanced insights into each reasoning step's strengths and weaknesses. Furthermore, Panel dynamically selects the best candidates, representing a fundamental departure from conventional verifier-based strategies that always choose the candidates with the highest PRM score.
  • Figure 2: A case study from AIME25 where Panel produces correct results while step-level self-evaluation fails.
  • Figure 3: Pass@k accuracies of our Panel and the baseline model. For reference, we also provide the results of Panel without NL Self-Critique (i.e., "Self-Level Self-Evaluation" in Table \ref{['tab:main']}).
  • Figure 4: Impact of NL self-critique on decision making at each reasoning step. The "divergence ratio" denotes the proportion of decisions that differ when using NL self-critique versus not using it.
  • Figure 5: Prompt of NL self-critique for math reasoning task.
  • ...and 4 more figures