Table of Contents
Fetching ...

Temporal Consistency for LLM Reasoning Process Error Identification

Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang

TL;DR

The paper tackles reliable verification of multi-step mathematical reasoning by introducing Temporal Consistency, a training-free, test-time framework where multiple verifiers iteratively re-evaluate their own judgments to reach stable, consensus identifications of the first incorrect step. It defines a formal task, three-phase verification, convergence criteria, and a stopping rule based on temporal stability and consensus, enabling vertical (time-based) scaling instead of horizontal ensemble growth. Empirically, Temporal Consistency yields substantial gains across ProcessBench, MathCheck*, and PRM800K, including strong results for distilled Deepseek-R1 models that surpass many large-scale baselines and even GPT-4o on ProcessBench. The method demonstrates favorable cost-performance behavior and robustness across problem difficulty, with ablation studies confirming the value of both iterative self-checking and multi-agent components. The work also provides extensive comparisons to existing verification and reasoning strategies and releases code for replication.

Abstract

Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency

Temporal Consistency for LLM Reasoning Process Error Identification

TL;DR

The paper tackles reliable verification of multi-step mathematical reasoning by introducing Temporal Consistency, a training-free, test-time framework where multiple verifiers iteratively re-evaluate their own judgments to reach stable, consensus identifications of the first incorrect step. It defines a formal task, three-phase verification, convergence criteria, and a stopping rule based on temporal stability and consensus, enabling vertical (time-based) scaling instead of horizontal ensemble growth. Empirically, Temporal Consistency yields substantial gains across ProcessBench, MathCheck*, and PRM800K, including strong results for distilled Deepseek-R1 models that surpass many large-scale baselines and even GPT-4o on ProcessBench. The method demonstrates favorable cost-performance behavior and robustness across problem difficulty, with ablation studies confirming the value of both iterative self-checking and multi-agent components. The work also provides extensive comparisons to existing verification and reasoning strategies and releases code for replication.

Abstract

Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency

Paper Structure

This paper contains 34 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance improvements for various models on process error identification benchmarks.
  • Figure 2: Overview of our Temporal Consistency approach, where each LLM iteratively examines its own verification results until reaching a stable result (stopping criteria defined in Section \ref{['algorithm']}). The self-checking mechanism allows LLMs to refine their judgments based on previous verifications, potentially correcting initial misidentification.
  • Figure 3: Cost v.s. Performance across different methods and models on ProcessBench. The x-axis (logarithmic scale) shows the cost per problem in dollars (based on OpenRouter pricing ), while the y-axis shows the F1 Score percentage.
  • Figure 4: Example of the self-checking process: The first error occurred in step 1. Initially, two LLMs incorrectly identified the first incorrect step, while one correctly located the first incorrect step. After self-checking, all LLMs achieve the correct identification.
  • Figure 5: Performance comparison across three datasets (Mathcheck$^*$, ProcessBench, and PRM800K). Our Temporal Consistency approach (green) consistently outperforms baseline methods, including greedy decoding (yellow), majority voting (orange), and multi-model debate (red).
  • ...and 4 more figures