Table of Contents
Fetching ...

Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness

Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao

TL;DR

This work analyzes Chain-of-Thought prompting from two angles: effectiveness and faithfulness. It identifies three key drivers of CoT performance—problem difficulty, information gain, and information flow—and demonstrates how these factors differentially influence gains across tasks. The authors reveal three unfaithful CoT patterns in logical reasoning and show that final answers can still benefit from recalling correct information from the question, even when CoT is flawed. They propose QUIRE, a method that recalls context information and uses information gain-based weighting to improve both faithfulness and effectiveness, with empirical gains up to 5.6% in faithfulness and 2.4% in accuracy. Overall, the paper highlights faithfulness as a lever to enhance CoT performance and offers a practical approach to mitigate unfaithful reasoning in LLMs.

Abstract

Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.

Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness

TL;DR

This work analyzes Chain-of-Thought prompting from two angles: effectiveness and faithfulness. It identifies three key drivers of CoT performance—problem difficulty, information gain, and information flow—and demonstrates how these factors differentially influence gains across tasks. The authors reveal three unfaithful CoT patterns in logical reasoning and show that final answers can still benefit from recalling correct information from the question, even when CoT is flawed. They propose QUIRE, a method that recalls context information and uses information gain-based weighting to improve both faithfulness and effectiveness, with empirical gains up to 5.6% in faithfulness and 2.4% in accuracy. Overall, the paper highlights faithfulness as a lever to enhance CoT performance and offers a practical approach to mitigate unfaithful reasoning in LLMs.

Abstract

Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.
Paper Structure (46 sections, 7 equations, 21 figures, 4 tables)

This paper contains 46 sections, 7 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: CoT improvement across different models and datasets, 'score' indicates the accuracy difference.
  • Figure 2: Performance on different problem difficulty levels with and without CoT prompting (Llama3.1-8B).
  • Figure 3: Difficulty distribution in different datasets.
  • Figure 4: CoT information gain in different datasets.
  • Figure 5: Information flow between the CoT and answer. 'Step' indicates sequential positions within the CoT, where 0 is the beginning and 100 is the end.
  • ...and 16 more figures