Table of Contents
Fetching ...

Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step

Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong

TL;DR

This work introduces Chain-of-Probe (CoP), a probing framework that output-synchronously tracks model confidence after each CoT step to assess the necessity and accuracy of Chain-of-Thought. By analyzing confidence trajectories, the authors show that many questions—especially easier ones—do not require CoT and that erroneous CoTs can accompany correct answers; they propose the CoP Score to select beneficial CoTs and the CoP Tree to filter out likely faulty reasoning. Across MMLU, BBH, and ARC datasets, CoP-based selection achieves accuracy comparable to or better than majority voting, with statistically significant gains on ARC-Challenge, and the CoP Tree improves CoT accuracy by about 13% on average. The findings highlight that robust reasoning should account for both answer correctness and the integrity of the reasoning process, offering low-cost, scalable strategies to improve reliability without full self-correction cycles like Tree-of-Thought. Limitations include applicability only to single-token final answers and imperfect recall of CoT correctness, guiding future work toward broader outputs and more nuanced error detection.

Abstract

Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.

Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step

TL;DR

This work introduces Chain-of-Probe (CoP), a probing framework that output-synchronously tracks model confidence after each CoT step to assess the necessity and accuracy of Chain-of-Thought. By analyzing confidence trajectories, the authors show that many questions—especially easier ones—do not require CoT and that erroneous CoTs can accompany correct answers; they propose the CoP Score to select beneficial CoTs and the CoP Tree to filter out likely faulty reasoning. Across MMLU, BBH, and ARC datasets, CoP-based selection achieves accuracy comparable to or better than majority voting, with statistically significant gains on ARC-Challenge, and the CoP Tree improves CoT accuracy by about 13% on average. The findings highlight that robust reasoning should account for both answer correctness and the integrity of the reasoning process, offering low-cost, scalable strategies to improve reliability without full self-correction cycles like Tree-of-Thought. Limitations include applicability only to single-token final answers and imperfect recall of CoT correctness, guiding future work toward broader outputs and more nuanced error detection.

Abstract

Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.

Paper Structure

This paper contains 35 sections, 7 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Diagram of early answering and Chain-of-Probe. Line graphs illustrate several typical patterns of confidence changes.
  • Figure 2: The pipeline of Chain-of-Probe with a running example. Yellow boxes represent each reasoning step in the CoT. Gray boxes denote predefined probing strings. In this case, {A, B, C, D} serves as the target token set, with each probing collecting the model's predicted probabilities for these four tokens (illustrated by yellow bar charts).
  • Figure 3: The left figure shows the early answering ratio in the model on the MMLU and BBH datasets. The right two figures compare the model's accuracy on these datasets, distinguishing between cases with early answering (EA) issues and those without (Not EA).
  • Figure 4: Relationship between EAR and accuracy. The two graphs on the left show the results of four models on the MMLU and BBH datasets. The right part shows the results of the LLaMA-2 7b model on the MMLU dataset, categorized by disciplines. Gaussian smoothing, with sigma=1 and order=0, was applied to each line to better observe overall trends.
  • Figure 5: The ratio of CoT changing answers from True to False versus from False to True.
  • ...and 6 more figures