Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

Hritik Bansal; Karthik Gopalakrishnan; Saket Dingliwal; Sravan Bodapati; Katrin Kirchhoff; Dan Roth

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, Dan Roth

TL;DR

This work interrogates the necessity of all components in large language models for in-context learning by pruning OPT-66B across 14 tasks. It introduces task-specific and task-agnostic importance scores to quantify component contributions and demonstrates that roughly 70% of attention heads and 20% of FFNs can be pruned with limited performance loss, with important heads showing cross-task overlap. The study further uncovers a small subset of induction heads capable of prefix matching and copying, overlapping with task-important heads and suggesting these heads underlie broader in-context learning capabilities. It argues that current pre-training may under-supply the components needed for robust in-context learning and proposes directions for training objectives to enhance induction-head formation and overall efficiency on instruction-tuned variants.

Abstract

Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, reinforcing arguments by Olsson et al. (arXiv:2209.11895) regarding induction head generality to more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained for in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

TL;DR

Abstract

70% of attention heads and

20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, reinforcing arguments by Olsson et al. (arXiv:2209.11895) regarding induction head generality to more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained for in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.

Paper Structure (30 sections, 7 equations, 19 figures, 1 table)

This paper contains 30 sections, 7 equations, 19 figures, 1 table.

Introduction
Background & Methods
Open Pre-trained Transformer (OPT)
In-Context Learning & Induction Heads
Importance Scores
Oracle
Gradient-based
Experimental Setup
Importance Scores for OPT-66B
Attention Heads
Feed Forward Networks
Iterative Pruning
Removing Attention Heads
Removing FFNs
Combined Removal of Heads & FFNs
...and 15 more sections

Figures (19)

Figure 1: A sample input prompt for in-context learning and the model output.
Figure 2: Prefix matching and copying depicted at a given time-step for a repeated sequence of tokens.
Figure 3: Attention head aggregate importance score heatmaps for in-context learning with OPT-66B.
Figure 4: Feed forward network (FFN) oracle importance scores for in-context learning with OPT-66B. Each FFN is knocked off independently to compute these scores, i.e., the curves are discrete and not cumulative.
Figure 5: Effect on in-context learning accuracy when removing attention heads in OPT-66B in an iterative manner based on task-specific and shot-specific importance scores.
...and 14 more figures

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

TL;DR

Abstract

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (19)