Table of Contents
Fetching ...

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu

TL;DR

The paper investigates whether CoT exemplars improve mathematical reasoning for recent strong LLMs and finds that zero-shot CoT generally suffices, with the main benefit of exemplars being output-format alignment rather than enhanced reasoning. It identifies and corrects evaluation biases in GSM8K, showing Zero-shot CoT often outperforms Few-shot CoT when format is properly handled. Enhanced exemplars and retrieval-based strategies yield only marginal gains, largely because strong models rely on intrinsic reasoning abilities and prompt structure. The work suggests a need to reexamine the ICL paradigm and exemplar design for future prompting strategies in mathematical reasoning tasks.

Abstract

In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

TL;DR

The paper investigates whether CoT exemplars improve mathematical reasoning for recent strong LLMs and finds that zero-shot CoT generally suffices, with the main benefit of exemplars being output-format alignment rather than enhanced reasoning. It identifies and corrects evaluation biases in GSM8K, showing Zero-shot CoT often outperforms Few-shot CoT when format is properly handled. Enhanced exemplars and retrieval-based strategies yield only marginal gains, largely because strong models rely on intrinsic reasoning abilities and prompt structure. The work suggests a need to reexamine the ICL paradigm and exemplar design for future prompting strategies in mathematical reasoning tasks.

Abstract

In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

Paper Structure

This paper contains 23 sections, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Accuracy under different prompting settings on GSM8K (top) and MATH (bottom). We observe that the Zero-shot setting consistently achieves strong performance, suggesting that the model may not attend to the CoT exemplars. See Section \ref{['sec:exp:exemplars-not-essential']} for the full experimental results.
  • Figure 2: An overview of ICL and CoT prompting. The figure illustrates the Few-shot CoT setting, where the model performs reasoning based on provided demonstrations and a test question. When no demonstrations are given, the setting corresponds to Zero-shot CoT.
  • Figure 3: Accuracy of different models on the GSM8K dataset under varying numbers of exemplars. Few-shot examples are taken from wei-COT. Only Zero-shot-fixed applies evaluation bias correction, as described in Section \ref{['sec:exp:align-format']}; all other settings retain uncorrected results for comparison.
  • Figure 4: Accuracy of different models on the GSM8K dataset under various ablation settings. Replace_Q denotes replacing the question in each exemplars with "xxx". Replace_QA replaces both the question and answer with "xxx" but retains the final phrase "So the answer is ...". Replace_ALL replaces the question, answer, and the final phrase with "xxx". See figure \ref{['demo:replace_q']}, \ref{['demo:replace_qa']}, and \ref{['demo:replace_all']} for input examples, respectively. Other settings follow those in Figure \ref{['fig:gsm-large-align-format']}.
  • Figure 5: Accuracy of different models under various retrieval methods with a fixed number of 8 retrieved exemplars. The top figure shows results on the MATH dataset, and the bottom figure shows results on the GSM8K dataset.
  • ...and 18 more figures