Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
Yansi Li, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Qiuzhi Liu, Rui Wang, Zhuosheng Zhang, Zhaopeng Tu, Haitao Mi, Dong Yu
TL;DR
Panel proposes a non-scalar, natural-language self-critique mechanism for stepwise inference-time reasoning in LLMs, replacing external verifiers with critique-generated feedback to guide a three-stage search (sampling, critique, decision). The approach enables richer justification for each candidate step and dynamically selects steps based on critique-informed reasoning, improving performance on challenging STEM benchmarks (AIME, GPQA) across model sizes. Empirical results show Panel outperforms both solution- and step-level self-evaluation baselines, with NL self-critique driving notable gains and early-step influence. The work suggests a scalable path to more robust, qualitative reasoning in diverse domains without task-specific verifiers, though future work should deepen theoretical understanding and explore broader applicability.
Abstract
Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.
