Table of Contents
Fetching ...

Are Emergent Abilities in Large Language Models just In-Context Learning?

Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych

TL;DR

This paper scrutinizes so-called emergent abilities in large language models and argues that, once in-context learning ($ICL$) is removed, these abilities largely disappear. Through over 1000 experiments across 20 models, 22 tasks, and multiple prompting settings, the authors show that instruction-tuned performance largely stems from an implicit $ICL$ mechanism rather than intrinsic functional linguistic capabilities. They demonstrate a substantial overlap between tasks solvable by non-instruction-tuned models in few-shot settings and instruction-tuned models in zero-shot settings, supporting the view that instruction-tuning enables $ICL$-like behavior. The work provides a theoretical framework for interpreting LLM capabilities, emphasizes safer and more efficient usage, and challenges the notion that scaling alone yields genuinely new abilities beyond prompting dynamics and memory.

Abstract

Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abilities is that they are confounded by model competencies that arise through alternative prompting techniques, including in-context learning, which is the ability of models to complete a task based on a few examples. We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings suggest that purported emergent abilities are not truly emergent, but result from a combination of in-context learning, model memory, and linguistic knowledge. Our work is a foundational step in explaining language model performance, providing a template for their efficient use and clarifying the paradox of their ability to excel in some instances while faltering in others. Thus, we demonstrate that their capabilities should not be overestimated.

Are Emergent Abilities in Large Language Models just In-Context Learning?

TL;DR

This paper scrutinizes so-called emergent abilities in large language models and argues that, once in-context learning () is removed, these abilities largely disappear. Through over 1000 experiments across 20 models, 22 tasks, and multiple prompting settings, the authors show that instruction-tuned performance largely stems from an implicit mechanism rather than intrinsic functional linguistic capabilities. They demonstrate a substantial overlap between tasks solvable by non-instruction-tuned models in few-shot settings and instruction-tuned models in zero-shot settings, supporting the view that instruction-tuning enables -like behavior. The work provides a theoretical framework for interpreting LLM capabilities, emphasizes safer and more efficient usage, and challenges the notion that scaling alone yields genuinely new abilities beyond prompting dynamics and memory.

Abstract

Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abilities is that they are confounded by model competencies that arise through alternative prompting techniques, including in-context learning, which is the ability of models to complete a task based on a few examples. We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings suggest that purported emergent abilities are not truly emergent, but result from a combination of in-context learning, model memory, and linguistic knowledge. Our work is a foundational step in explaining language model performance, providing a template for their efficient use and clarifying the paradox of their ability to excel in some instances while faltering in others. Thus, we demonstrate that their capabilities should not be overestimated.
Paper Structure (31 sections, 33 figures, 6 tables)

This paper contains 31 sections, 33 figures, 6 tables.

Figures (33)

  • Figure 1: Performance of non-instruction-tuned GPT models in the zero-shot setting. Grey background indicates tasks that are not previously identified as emergent. Tasks that require the output of a number or a coded string are evaluated using exact match accuracy. Note the consistent lack of "emergence", see text for details.
  • Figure 2: The substantial overlap of the tasks on which the two models perform above the random baseline is noteworthy and indicates that instruction-tuning allows for the effective access of in-context capabilities rather than leading to the emergence of functional linguistic abilities. See text for details.
  • Figure 3: The figure on the left depicts prompting using ICL, where the model infers the task and the patterns based on a few examples. The figure on the right presents a few of the templates used to generate instruction fine-tuning data which models are fine-tuned on to allow them to better interpret prompts. The task depicted in these examples is Analytical entailment and the templates are from the Flan instruction fine-tuning dataset wei2022finetuned.
  • Figure 4: Performance of non-instruction-tuned GPT models using the adversarial prompt on the subset of tasks wherein the performance is above the random baseline. The subplot with grey background indicates that the task is not previously identified to be emergent. The performance on Codenames, Phrase relatedness, and Strange stories is predictable and so not emergent. Across the remaining tasks, the improvements in performance compared to the random baseline are relatively modest. Additionally, of the tasks on which the performance gain is slightly more notable, we find that Physical intuition is a memory intensive task and Common morpheme has a small test set.
  • Figure 5: A comparison of the performance of Flan-T5-large (zero-shot), GPT-J (few-shot), text-davinci-001 (zero-shot), and text-davinci-003 (zero-shot) using the completion prompt. The subplots with grey background are results for tasks that are not previously identified to be emergent. Modified arithmetic is excluded from the analysis, as the task is constructed in a manner that requires the use of in-context demonstrations. The substantial overlap of the tasks on which the two models perform above the random baseline is noteworthy and indicates that instruction-tuning allows for the effective access of in-context capabilities rather than leading to the emergence of functional linguistic abilities.
  • ...and 28 more figures