Table of Contents
Fetching ...

SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

TL;DR

SSR tackles test-time reasoning by decomposing a model's reasoning trace into verifiable Socratic steps and estimating per-step confidence through controlled re-solving and self-consistency checks. It then refines the weakest steps and optionally refines high-level plans (SSR-Plan) to improve robustness and efficiency. Empirical results on five reasoning tasks (three mathematical, two logical) across multiple LLMs show SSR outperforming state-of-the-art iterative self-refinement baselines, with SSR-Plan delivering the strongest gains on challenging problems like AIME. Beyond accuracy, SSR provides a black-box framework for interpreting and evaluating internal reasoning dynamics, supporting principled and controllable test-time reasoning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

SSR: Socratic Self-Refine for Large Language Model Reasoning

TL;DR

SSR tackles test-time reasoning by decomposing a model's reasoning trace into verifiable Socratic steps and estimating per-step confidence through controlled re-solving and self-consistency checks. It then refines the weakest steps and optionally refines high-level plans (SSR-Plan) to improve robustness and efficiency. Empirical results on five reasoning tasks (three mathematical, two logical) across multiple LLMs show SSR outperforming state-of-the-art iterative self-refinement baselines, with SSR-Plan delivering the strongest gains on challenging problems like AIME. Beyond accuracy, SSR provides a black-box framework for interpreting and evaluating internal reasoning dynamics, supporting principled and controllable test-time reasoning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

Paper Structure

This paper contains 26 sections, 12 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Test-Time Parallel Scaling Performance (Left) and Conceptual Overview (Right) of our proposed Socratic Self-Refine (SSR). By decomposing responses into Socratic steps, re-evaluating intermediate results through self-consistency, and refining specific step-level errors, SSR achieves substantially higher initial accuracy ($\sim$67.57% relative improvement) and continues to scale effectively even when standard Chain-of-Thought (CoT) begins to saturate. Notably, this performance advantage holds under comparable computational cost. Experiments are conducted with GPT-5-mini in low-reasoning, low-verbosity mode.
  • Figure 2: Overview of Socratic Self-Refine (SSR). Block ①: Chain-of-Thought (CoT) reasoning, serves as the starting point for the iterative refinement methods; Block ②: Simple Self-Refine, generates feedback and then refines the original response based on the feedback; Block ③: Plan refinement, summarizes the high-level plan of a reasoning trace, and refines the plan and the trace if necessary; Block ④-⑥: Three building blocks of our SSR, includes Socratic decomposition, Socratic verification, and Socratic refinement. SSR-Lin: Linear SSR, faithfully applies three blocks (④-⑥) for $K$ iterations; SSR-Ada: Adaptive SSR, only carries out Socratic blocks (④-⑥) when the normal Self-Refine cannot identify any mistakes (${\bm{c}}={\bm{c}}_{\max}$); SSR-Plan: Adaptive SSR with Plan Refinement, adds an additional plan refinement round (③) before the full iterative refinement algorithm (④-⑥).
  • Figure 3: Performance of Sequential (Left) and Parallel (Right) Test-Time Scaling, evaluated on AIME25 aime with GPT-5-mini low-reasoning low-verbosity mode.
  • Figure 4: Performance of our SSR, with explicit control of granularity, evaluated on AIME24 and AIME25 aime with GPT-5-mini low-reasoning low-verbosity mode. The marker size of each data point is proportional to the support size.
  • Figure 5: Performance of Parallel Test-Time Scaling, evaluated on AIME25 with GPT-5-mini low-reasoning low-verbosity mode. Iterative refinement (both Self-Refine madaan2023self and our SSR) holds non-trivial advantage against CoT wei2022chain in terms of absolute performance and budget control. Our SSR outperforms the baselines under the same budget, with SSR's confidence estimation playing a crucial role.
  • ...and 11 more figures