Table of Contents
Fetching ...

How Many Demonstrations Do You Need for In-context Learning?

Jiuhai Chen, Lichang Chen, Chen Zhu, Tianyi Zhou

TL;DR

This work interrogates how many demonstrations are truly needed for effective in-context learning (ICL) and how chain-of-thought (CoT) prompts interact with demo quantity. Through large-scale empirical analysis using code-davinci-002 on diversified reasoning benchmarks, it shows that a single positive demo often outperforms multi-demo prompts, revealing dataset biases toward easy queries and a tendency for cross-demo interference to misguide LLMs. The findings underscore inefficiencies in current multi-demo ICL, motivate rethinking benchmark design and demo selection, and suggest avenues to train models to better distinguish and fuse demonstrations without harmful interference. The study has practical implications for cost-efficient prompting and for advancing ICL research toward more reliable and scalable reasoning with LLMs.

Abstract

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.

How Many Demonstrations Do You Need for In-context Learning?

TL;DR

This work interrogates how many demonstrations are truly needed for effective in-context learning (ICL) and how chain-of-thought (CoT) prompts interact with demo quantity. Through large-scale empirical analysis using code-davinci-002 on diversified reasoning benchmarks, it shows that a single positive demo often outperforms multi-demo prompts, revealing dataset biases toward easy queries and a tendency for cross-demo interference to misguide LLMs. The findings underscore inefficiencies in current multi-demo ICL, motivate rethinking benchmark design and demo selection, and suggest avenues to train models to better distinguish and fuse demonstrations without harmful interference. The study has practical implications for cost-efficient prompting and for advancing ICL research toward more reliable and scalable reasoning with LLMs.

Abstract

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.
Paper Structure (20 sections, 12 figures, 2 tables)

This paper contains 20 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: ICL without CoT: Prompting with one random demo has a slightly lower accuracy than few-shot prompting (8 or 7 demos). Prompting with one positive demo significantly outperforms few-shot prompting.
  • Figure 2: ICL with CoT: Prompting with one random demo has a slightly lower accuracy than CoT prompting (8 or 7 demos). Prompting with one positive demo significantly outperforms CoT prompting.
  • Figure 3: Negative/Positive Demo. In one demo ICL for a test query, a negative demo leads to an incorrect answer while a positive demo results in the correct answer.
  • Figure 4: Easy/Hard Samples from GSM8K: for the hard query (Mark plants a beanstalk ...), all the 8 demos are negative and result in wrong answers in one-demo ICL; for the easy query (Alisa biked 12 miles ...), all the 8 demos are positive and lead to the correct answer. The 8 demos for arithmetic problems are from wei2022chain.
  • Figure 5: Pie chart on the number of positive demos (ICL with CoT) per sample/query ($0\sim 6$ inside the pie chart) for queries in (a) the whole GSM8K dataset ; (b) GSM-Hard; (c): GSM-Easy.
  • ...and 7 more figures