Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step
Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
TL;DR
This work introduces Chain-of-Probe (CoP), a probing framework that output-synchronously tracks model confidence after each CoT step to assess the necessity and accuracy of Chain-of-Thought. By analyzing confidence trajectories, the authors show that many questionsâespecially easier onesâdo not require CoT and that erroneous CoTs can accompany correct answers; they propose the CoP Score to select beneficial CoTs and the CoP Tree to filter out likely faulty reasoning. Across MMLU, BBH, and ARC datasets, CoP-based selection achieves accuracy comparable to or better than majority voting, with statistically significant gains on ARC-Challenge, and the CoP Tree improves CoT accuracy by about 13% on average. The findings highlight that robust reasoning should account for both answer correctness and the integrity of the reasoning process, offering low-cost, scalable strategies to improve reliability without full self-correction cycles like Tree-of-Thought. Limitations include applicability only to single-token final answers and imperfect recall of CoT correctness, guiding future work toward broader outputs and more nuanced error detection.
Abstract
Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.
