Table of Contents
Fetching ...

Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong

TL;DR

This work tackles inefficiencies in test-time scaling for mathematical reasoning by exposing stepwise checkpoints within LLM reasoning. It introduces SRCA, consisting of Checkpoint Injection, Answer-Clustered Search, and Checkpoint Candidate Augmentation, to diversify reasoning paths and fully leverage intermediate results. Empirical results across GSM8K, MATH500, AIME, and OlympiadBench show SRCA outperforms Beam Search and DVTS, enabling smaller models to rival larger ones and achieving higher data efficiency through reduced sampling. The findings highlight the value of intermediate checkpoints for robust, fault-tolerant reasoning and offer a practical direction for future TTS research.

Abstract

Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

TL;DR

This work tackles inefficiencies in test-time scaling for mathematical reasoning by exposing stepwise checkpoints within LLM reasoning. It introduces SRCA, consisting of Checkpoint Injection, Answer-Clustered Search, and Checkpoint Candidate Augmentation, to diversify reasoning paths and fully leverage intermediate results. Empirical results across GSM8K, MATH500, AIME, and OlympiadBench show SRCA outperforms Beam Search and DVTS, enabling smaller models to rival larger ones and achieving higher data efficiency through reduced sampling. The findings highlight the value of intermediate checkpoints for robust, fault-tolerant reasoning and offer a practical direction for future TTS research.

Abstract

Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

Paper Structure

This paper contains 25 sections, 1 equation, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of SRCA. Top-right: The checkpoint operation, which serves as the atomic operation in SRCA. Left: Illustration of ACS strategy at step $i$, where $N=6$ and $M=2$. Retrieved reasoning steps are clustered into three groups based on their checkpoint answers (indicated by different shades), with the highest-scoring nodes selected from clusters with answers 6 and 4 for subsequent reasoning. Bottom-right: CCA strategy, where paths 3 and 4 represent high-quality intermediate reasoning steps collected by CCA.
  • Figure 2: Performance trends of TTS methods with DeepSeek PRM (top row) and Skywork PRM (bottom row) and as the sampling number $N$ increases from 16 to 128. In the bottom row, we additionally mark the performance of the 70B model with a green line for comparison.
  • Figure 3: Pass@K trends of the 1B model with different TTS methods and DeepSeek PRM as the sampling number increases from 16 to 128. Note that for Pass@K calculation, Self-Consistency, BoN, and Weighted BoN degrade to Independent Sampling.
  • Figure 4: The average accuracy and search depth of SRCA with early stopping strategies under different values of tau. The left y-axis represents the search depth, while the right y-axis represents the accuracy (%). The dashed line in the figure annotates the reduction rate of tree depth, i.e., the number of reasoning steps, when $tau=0.95$. The pentagon represents the best performance.
  • Figure 5: Ablation study results on four datasets, grouped by different values of $N$. For the bars corresponding to methods incorporating CCA, the Checkpoint Answer Rate (CAR) is additionally marked with slashes shading. The average CAR for each dataset is indicated in the top-left corner of each subplot.