Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng; Chengyan Pan; Minjun Zhao; Deyang Li; Fangchao Liu; Xinyu Zhang; Xiao Zhang; Yong Liu

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu

TL;DR

The paper investigates whether CoT exemplars improve mathematical reasoning for recent strong LLMs and finds that zero-shot CoT generally suffices, with the main benefit of exemplars being output-format alignment rather than enhanced reasoning. It identifies and corrects evaluation biases in GSM8K, showing Zero-shot CoT often outperforms Few-shot CoT when format is properly handled. Enhanced exemplars and retrieval-based strategies yield only marginal gains, largely because strong models rely on intrinsic reasoning abilities and prompt structure. The work suggests a need to reexamine the ICL paradigm and exemplar design for future prompting strategies in mathematical reasoning tasks.

Abstract

In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

TL;DR

Abstract

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)