Table of Contents
Fetching ...

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See

TL;DR

Chain-of-Thought prompting, while effective in many tasks, systematically underperforms direct answering in pattern-based in-context learning across 16 LLMs and 9 benchmarks. The authors hypothesize that a contextual distance curse, plus difficulties in inferring latent patterns, undermine CoT explanations; they reveal an explicit-implicit hybrid mechanism where implicit reasoning often drives correct answers despite flawed rationales, but CoT structure diminishes this advantage. They show that longer CoT reasoning yields at best comparable performance despite $40\times$ more inference tokens, and even specialized reasoning models fail to overcome these limitations. The work argues for adaptive, context-aware reasoning strategies that balance explicit and implicit processes and highlights implications for designing robust reasoning in LLMs.

Abstract

Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

TL;DR

Chain-of-Thought prompting, while effective in many tasks, systematically underperforms direct answering in pattern-based in-context learning across 16 LLMs and 9 benchmarks. The authors hypothesize that a contextual distance curse, plus difficulties in inferring latent patterns, undermine CoT explanations; they reveal an explicit-implicit hybrid mechanism where implicit reasoning often drives correct answers despite flawed rationales, but CoT structure diminishes this advantage. They show that longer CoT reasoning yields at best comparable performance despite more inference tokens, and even specialized reasoning models fail to overcome these limitations. The work argues for adaptive, context-aware reasoning strategies that balance explicit and implicit processes and highlights implications for designing robust reasoning in LLMs.

Abstract

Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

Paper Structure

This paper contains 35 sections, 3 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: (a) Performance of direct answering, CoT, ReAct, and ToT across 9 ICL benchmarks, averaged over 16 LLMs. (b) Performance gaps between direct answering and CoT with varying numbers of demonstrations.
  • Figure 2: Detailed benchmark performance of LLMs with direct answering, CoT, ReAct, and ToT. Gemma2 models were excluded from ARC-AGI experiments due to limited context length.
  • Figure 3: (a) Average performance with dummy rationale in Shakespeare's Sonnet. (b) Average performance with dummy rationale in countdown list. (c) Effect of rationale frontloading. All scores represent mean accuracies across six LLMs.
  • Figure 4: Performance comparison of pattern inference and execution across two benchmarks (List Function and MiniSCAN) and six LLMs.
  • Figure 5: Decomposition of CoT success: contributions from explicit and implicit reasoning.
  • ...and 1 more figures