Table of Contents
Fetching ...

Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Lizhe Chen, Baolong Bi, Xueqi Cheng

TL;DR

This work investigates whether external Chain-of-Thought prompts improve reasoning in Innate CoT-enabled LLMs (RLLMs) across six mathematical benchmarks and model scales from 1.5B to 32B. It reveals that Zero-shot and Few-shot CoT prompts generally enhance accuracy, with large models benefiting most on complex tasks and smaller models on simpler ones; notably, one-shot CoT often yields the best overall performance. CoT prompting also modulates thinking-token distributions and reasoning steps, reducing excessive reflection by up to ~90% in some cases, while attention analyses show RLLMs overfit to reflection cues that CoT guidance helps mitigate. The findings provide practical prompting guidelines for optimizing RLLM reasoning, highlighting that external CoT remains valuable for improving mathematical problem solving and controlling overthinking.

Abstract

Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?" In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.

Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

TL;DR

This work investigates whether external Chain-of-Thought prompts improve reasoning in Innate CoT-enabled LLMs (RLLMs) across six mathematical benchmarks and model scales from 1.5B to 32B. It reveals that Zero-shot and Few-shot CoT prompts generally enhance accuracy, with large models benefiting most on complex tasks and smaller models on simpler ones; notably, one-shot CoT often yields the best overall performance. CoT prompting also modulates thinking-token distributions and reasoning steps, reducing excessive reflection by up to ~90% in some cases, while attention analyses show RLLMs overfit to reflection cues that CoT guidance helps mitigate. The findings provide practical prompting guidelines for optimizing RLLM reasoning, highlighting that external CoT remains valuable for improving mathematical problem solving and controlling overthinking.

Abstract

Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?" In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.

Paper Structure

This paper contains 29 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: CoT prompting continues to play an important role in reasoning LLMs: (1) improving reasoning performance, (2) controlling the number of thinking tokens, (3) regulating the number of reasoning steps, and (4) mitigating overthinking.
  • Figure 2: Distributions of thinking tokens across various RLLMs under three prompting methods evaluated on the MATH benchmark. The horizontal axis indicates the number of thinking tokens in the thinking parts (#Token), and the vertical axis represents the corresponding ratio. Histograms labeled "Correct" and "Incorrect" depict the distribution of token counts for correctly and incorrectly solved problems, while the trend lines ("Correct Trend" and "Incorrect Trend") represent smoothed regression fits of these distributions.
  • Figure 3: Relationship between accuracy and the average number of reasoning steps for different RLLMs evaluated on AIME24 and AMC23. The horizontal axis represents the average number of reasoning steps (#Steps), and the vertical axis represents accuracy. Dotted lines indicate regression fits illustrating the general correlation trends between average number of reasoning steps and accuracy.
  • Figure 4: Relationship between the number of reasoning steps (#Step) and accuracy of RLLMs on the GSM8K and ASDiv datasets. The accuracy is averaged across individual reasoning steps provided by the RLLMs. Results show that accuracy initially increases with the number of steps but declines after reaching an optimal point (around 2-3 steps).
  • Figure 5: Visualization of Attention Distribution Mechanisms in LLaMA3.1-8B-Instruct and R1-8B. The heatmaps (left side) show attention logits (before softmax), averaged over all heads per layer, and the corresponding bar graphs (right side) illustrate the softmax-normalized attention scores for the input sequence "Wait, let me double-check to ensure I haven't misread the problem." Subfigures (a)-(d) represent the LLaMA3.1-8B-Instruct at layers 9 and 26, while subfigures (e)-(h) depict the R1-8B at the same layers. Here, the attention scores, denoted by $\alpha$, are computed as $\alpha = \mathbb{E}_{h}\left[\sigma(\mathbf{A})\right]$.
  • ...and 2 more figures