Temporal Consistency for LLM Reasoning Process Error Identification

Jiacheng Guo; Yue Wu; Jiahao Qiu; Kaixuan Huang; Xinzhe Juan; Ling Yang; Mengdi Wang

Temporal Consistency for LLM Reasoning Process Error Identification

Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang

TL;DR

The paper tackles reliable verification of multi-step mathematical reasoning by introducing Temporal Consistency, a training-free, test-time framework where multiple verifiers iteratively re-evaluate their own judgments to reach stable, consensus identifications of the first incorrect step. It defines a formal task, three-phase verification, convergence criteria, and a stopping rule based on temporal stability and consensus, enabling vertical (time-based) scaling instead of horizontal ensemble growth. Empirically, Temporal Consistency yields substantial gains across ProcessBench, MathCheck*, and PRM800K, including strong results for distilled Deepseek-R1 models that surpass many large-scale baselines and even GPT-4o on ProcessBench. The method demonstrates favorable cost-performance behavior and robustness across problem difficulty, with ablation studies confirming the value of both iterative self-checking and multi-agent components. The work also provides extensive comparisons to existing verification and reasoning strategies and releases code for replication.

Abstract

Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency

Temporal Consistency for LLM Reasoning Process Error Identification

TL;DR

Abstract

Temporal Consistency for LLM Reasoning Process Error Identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)