Table of Contents
Fetching ...

Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin

TL;DR

This work addresses the gap between reflective reasoning and actionable self-improvement in slow-thinking LLMs by introducing Double-Checker, a framework that couples direct inference with a curated self-critique and refinement loop. By training on 1,730 critique-refine instances, the method enables iterative self-correction during inference, significantly boosting performance on challenging math benchmarks (e.g., AIME pass@1) and demonstrating strong generalization to multidisciplinary tasks like GPQA. Key contributions include identifying that the “aha moment” alone is insufficient for self-improvement, detailing a four-step training process (Initial Generation, Critique, Refinement, Distillation), and validating substantial accuracy gains across multiple benchmarks with manageable computational costs. The approach offers a practical path toward more trustworthy and capable LLMs by formalizing self-critique as a learnable, iterative process embedded directly in the model’s inference loop.

Abstract

While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker

Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

TL;DR

This work addresses the gap between reflective reasoning and actionable self-improvement in slow-thinking LLMs by introducing Double-Checker, a framework that couples direct inference with a curated self-critique and refinement loop. By training on 1,730 critique-refine instances, the method enables iterative self-correction during inference, significantly boosting performance on challenging math benchmarks (e.g., AIME pass@1) and demonstrating strong generalization to multidisciplinary tasks like GPQA. Key contributions include identifying that the “aha moment” alone is insufficient for self-improvement, detailing a four-step training process (Initial Generation, Critique, Refinement, Distillation), and validating substantial accuracy gains across multiple benchmarks with manageable computational costs. The approach offers a practical path toward more trustworthy and capable LLMs by formalizing self-critique as a learnable, iterative process embedded directly in the model’s inference loop.

Abstract

While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker

Paper Structure

This paper contains 33 sections, 2 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Double-Checker correctly solves a math problem in AIME24 leveraging self-critique, while DeepSeek-Qwen-7B still gets the same wrong answer under a self-critical probe.
  • Figure 2: The overview of Double-Checker. (a) Direct inference pipeline of long-CoT LLMs: generating a long thought ($T_0$) followed by a summary ($S_0$) that concludes the answer ($A_0$) for the question ($Q$). (b) The inference pipeline of iterative refinement with self-critique. (c) Training stage of our Double-Checker. (d) Adaptive inference with self-critique of our Double-Checker.
  • Figure 3: Accuracy comparisons on AIME24 for two model sizes (7B and 32B). We compare: (1) DS-Distill-Qwen (a distilled baseline), (2) Naive SFT (fine-tuning without explicit critique), and (3) Double-Checker with varying rounds of self-critique ($N=0,1,2,3$).
  • Figure 4: Token usage for AIME24 (left) and GPQA (right). Blue solid line: the per-round average token count, orange dashed line: the cumulative token count over all rounds; green dash-dotted line: the average token consumption for "naive SFT" baseline without iterative refinement.
  • Figure 5: The Probe of Our Experiment in Sec. \ref{['sec:aha_moment_vs_critique']}.
  • ...and 5 more figures