Table of Contents
Fetching ...

When do you need Chain-of-Thought Prompting for ChatGPT?

Jiuhai Chen, Lichang Chen, Heng Huang, Tianyi Zhou

TL;DR

This work investigates whether Chain-of-Thought prompting remains effective for ChatGPT, a model trained with instruction finetuning and RLHF. It conducts a systematic, zero-shot evaluation across three prompting strategies on diverse reasoning benchmarks, comparing ChatGPT with GPT-3. The results show ChatGPT often generates CoT steps spontaneously for arithmetic tasks and may not benefit from explicit CoT prompts, while non-arithmetic tasks can still benefit, indicating task-dependent behavior and possible instruction memorization. The study highlights risks of pretraining-data leakage and demonstrates a potential dataset-inference attack pathway, emphasizing the need for careful prompt design and further study of instruction-following generalization in API LLMs.

Abstract

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.

When do you need Chain-of-Thought Prompting for ChatGPT?

TL;DR

This work investigates whether Chain-of-Thought prompting remains effective for ChatGPT, a model trained with instruction finetuning and RLHF. It conducts a systematic, zero-shot evaluation across three prompting strategies on diverse reasoning benchmarks, comparing ChatGPT with GPT-3. The results show ChatGPT often generates CoT steps spontaneously for arithmetic tasks and may not benefit from explicit CoT prompts, while non-arithmetic tasks can still benefit, indicating task-dependent behavior and possible instruction memorization. The study highlights risks of pretraining-data leakage and demonstrates a potential dataset-inference attack pathway, emphasizing the need for careful prompt design and further study of instruction-following generalization in API LLMs.

Abstract

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.
Paper Structure (17 sections, 1 equation, 6 figures, 3 tables)

This paper contains 17 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An example of arithmetic reasoning by different LLMs when prompting without instruction, i.e., the input only contains the original question. While Text-Davinci-002 and Text-Davinci-003 generate wrong answers, ChatGPT spontaneously generates a sequence of CoT reasoning steps leading to a correct answer.
  • Figure 2: Comparison of the three prompting strategies in Section \ref{['sec:prompts']} when applied to ChatGPT for six arithmetic reasoning tasks.
  • Figure 3: Comparison of the three prompting strategies in Section \ref{['sec:prompts']} when applied to ChatGPT for common sense, symbolic, and other two reasoning tasks.
  • Figure 4: Zero-shot reasoning without instruction (the first query) followed by prompting with trigger words.
  • Figure 5: Zero-shot reasoning with CoT instruction (the first query) followed by prompting with trigger words.
  • ...and 1 more figures