Table of Contents
Fetching ...

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Quanyu Long, Yin Wu, Wenya Wang, Sinno Jialin Pan

TL;DR

This study interrogates what drives in-context learning (ICL) effectiveness by decomposing improvements into label space, label format, and discrimination across multiple tasks and four general-purpose language models. It finds that improvements mainly arise from regulation of label space and format, while discrimination gains are limited and unstable. Retrieval of semantically similar demonstrations significantly boosts discrimination, though diversity in demonstrations remains crucial to avoid weakening space/format effects. The results imply ICL largely acts as an implicit instruction mechanism, guiding outputs through label-space and formatting cues rather than consistently enhancing latent discriminative knowledge, with practical implications for demonstration selection and retrieval strategies.

Abstract

In-context Learning (ICL) has emerged as a powerful capability alongside the development of scaled-up large language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without updating millions of parameters. However, the precise contributions of demonstrations towards improving end-task performance have not been thoroughly investigated in recent analytical studies. In this paper, we empirically decompose the overall performance of ICL into three dimensions, label space, format, and discrimination, and we evaluate four general-purpose LLMs across a diverse range of tasks. Counter-intuitively, we find that the demonstrations have a marginal impact on provoking discriminative knowledge of language models. However, ICL exhibits significant efficacy in regulating the label space and format, which helps LLMs respond to desired label words. We then demonstrate that this ability functions similar to detailed instructions for LLMs to follow. We additionally provide an in-depth analysis of the mechanism of retrieval helping with ICL. Our findings demonstrate that retrieving the semantically similar examples notably boosts the model's discriminative capability. However, we also observe a trade-off in selecting good in-context examples regarding label diversity.

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

TL;DR

This study interrogates what drives in-context learning (ICL) effectiveness by decomposing improvements into label space, label format, and discrimination across multiple tasks and four general-purpose language models. It finds that improvements mainly arise from regulation of label space and format, while discrimination gains are limited and unstable. Retrieval of semantically similar demonstrations significantly boosts discrimination, though diversity in demonstrations remains crucial to avoid weakening space/format effects. The results imply ICL largely acts as an implicit instruction mechanism, guiding outputs through label-space and formatting cues rather than consistently enhancing latent discriminative knowledge, with practical implications for demonstration selection and retrieval strategies.

Abstract

In-context Learning (ICL) has emerged as a powerful capability alongside the development of scaled-up large language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without updating millions of parameters. However, the precise contributions of demonstrations towards improving end-task performance have not been thoroughly investigated in recent analytical studies. In this paper, we empirically decompose the overall performance of ICL into three dimensions, label space, format, and discrimination, and we evaluate four general-purpose LLMs across a diverse range of tasks. Counter-intuitively, we find that the demonstrations have a marginal impact on provoking discriminative knowledge of language models. However, ICL exhibits significant efficacy in regulating the label space and format, which helps LLMs respond to desired label words. We then demonstrate that this ability functions similar to detailed instructions for LLMs to follow. We additionally provide an in-depth analysis of the mechanism of retrieval helping with ICL. Our findings demonstrate that retrieving the semantically similar examples notably boosts the model's discriminative capability. However, we also observe a trade-off in selecting good in-context examples regarding label diversity.
Paper Structure (37 sections, 8 figures, 13 tables)

This paper contains 37 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Inference instances can be categorized into three different sets, out-of-space (OOS), in-space-out-of-format (ISOOF) and in-space-in-format (ISIF). When performing ICL, a large proportion (almost all in our experiments) of OOS and ISOOF shift to ISIF.
  • Figure 2: Classification results of decomposed ICL contribution: discrimination (red), label space (blue), label format (green) when using Random demonstrations. Scores below zero represent this factor has a negative effect on the performance. We find that discrimination power is the most unstable factor in ICL improvement.
  • Figure 3: Right-to-Wrong (R2W) and Wrong-to-Right (W2R) percentage within the ISIF set. After performing ICL, R2W accounts for a large percentage surprisingly.
  • Figure 4: Impact of DI (detailed instruction), ICL and their combination DI+ICL. Results are averaged scores across all classification tasks. Breakdown scores are provided in Appendix \ref{['appendix:6']}. We observe that DI and ICL demonstrate similar performance and the benefit of ICL is nearly diminished when comparing the results of DI+ICL with ICL.
  • Figure 5: Impact of incorrect labels within the demonstrations compared to ground truth labels. Results are averaged scores across all classification tasks. (a) is the ICL overall improvement compared to the zero-shot setting; (b) is the decomposed discrimination score; (c) is the new ISIF percentage coming from OOS and ISOOF, this score can be viewed as the combination of label space and format. Figure (a) and (b) demonstrate a decrease in ICL performance and discrimination power when demonstrations contain incorrect labels, while the label space and format power remain unaffected in Figure (c).
  • ...and 3 more figures