Table of Contents
Fetching ...

Context-Scaling versus Task-Scaling in In-Context Learning

Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin

TL;DR

This work identifies and analyzes two key components of ICL: context-scaling, where model performance improves as the number of in-context examples increases and task-scaling, where model performance improves as the number of pre-training tasks increases.

Abstract

Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear regression, teacher-student settings. Furthermore, a single block of our simplified transformer can be viewed as data dependent feature map followed by an MLP. This feature map on its own is a powerful predictor that is capable of context-scaling but is not capable of task-scaling. We show empirically that concatenating the output of this feature map with vectorized data as an input to MLPs enables both context-scaling and task-scaling. This finding provides a simple setting to study context and task-scaling for ICL.

Context-Scaling versus Task-Scaling in In-Context Learning

TL;DR

This work identifies and analyzes two key components of ICL: context-scaling, where model performance improves as the number of in-context examples increases and task-scaling, where model performance improves as the number of pre-training tasks increases.

Abstract

Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear regression, teacher-student settings. Furthermore, a single block of our simplified transformer can be viewed as data dependent feature map followed by an MLP. This feature map on its own is a powerful predictor that is capable of context-scaling but is not capable of task-scaling. We show empirically that concatenating the output of this feature map with vectorized data as an input to MLPs enables both context-scaling and task-scaling. This finding provides a simple setting to study context and task-scaling for ICL.

Paper Structure

This paper contains 41 sections, 29 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Task-scaling and context-scaling of GPT-2 architecture transformers versus MLPs for ICL with linear regression tasks. (A) Task-scaling abilities of these models with $10$ in-context examples. (B) Context-scaling abilities of these models with $10^5$ (left) and $10^6$ (right) pre-training tasks. Experimental details are provided in Appendix \ref{['app:A']}.
  • Figure 2: Linear regression with a single noise level. Left panel. Performance across varying context lengths (context-scaling). Right panel. Effect of regularization on performance for a fixed number of in-context examples. Experimental details are given in Appendix \ref{['app:A']}.
  • Figure 3: Linear regression with multiple noise levels. Left and middle panels: Performance across varying context lengths (context-scaling). Right panel: Effect of regularization on performance for a fixed number of in-context examples. Experimental details are given in Appendix \ref{['app:A']}.
  • Figure 4: Nonlinear ICL tasks. Context-scaling capability of SGPT versus GPT-2 architecture transformers when trained on 2 million pre-training tasks. In all cases, the errors are normalized so that the trivial zero predictor achieves an error of 1.$^*$ Experimental details are given in Appendix \ref{['app:A']}.
  • Figure 5: Context-scaling with one-layer SGPT. Experimental details are provided in Appendix \ref{['app:A']}.
  • ...and 3 more figures