SSR: Socratic Self-Refine for Large Language Model Reasoning
Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz
TL;DR
SSR tackles test-time reasoning by decomposing a model's reasoning trace into verifiable Socratic steps and estimating per-step confidence through controlled re-solving and self-consistency checks. It then refines the weakest steps and optionally refines high-level plans (SSR-Plan) to improve robustness and efficiency. Empirical results on five reasoning tasks (three mathematical, two logical) across multiple LLMs show SSR outperforming state-of-the-art iterative self-refinement baselines, with SSR-Plan delivering the strongest gains on challenging problems like AIME. Beyond accuracy, SSR provides a black-box framework for interpreting and evaluating internal reasoning dynamics, supporting principled and controllable test-time reasoning.
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.
