Table of Contents
Fetching ...

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

TL;DR

This study explores a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability the authors term "task superposition", and provides empirical evidence of this phenomenon across various LLM families and scales.

Abstract

Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

TL;DR

This study explores a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability the authors term "task superposition", and provides empirical evidence of this phenomenon across various LLM families and scales.

Abstract

Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.
Paper Structure (34 sections, 7 theorems, 42 equations, 6 figures, 1 table)

This paper contains 34 sections, 7 theorems, 42 equations, 6 figures, 1 table.

Key Result

Theorem 1

A seven layer transformer with embedding dimension ${\mathcal{O}}(d + \log(mn))$ with $K$ heads per attention layer can perform $K$ tasks on vectors of dimension $d$ in superposition, with weighting based on $m$ different in-context examples each of length $n$ .

Figures (6)

  • Figure 1: LLMs can perform task superposition. (a) Llama-3 70B and (b) GPT-3.5 Turbo are each presented with two sets of tasks. For each set of tasks, we show an example prompt such that all except the last row are in-context examples of one of the tasks and the last row is the query. We provide $20$ in-context task examples for each task in the prompt with order randomized and provide the probabilities of outputs when correctly performing each task on the query.
  • Figure 2: Distributions of probabilities for correct outputs for each task in box plots, where 0/25/50/75/100-percentiles are shown and region between 25 and 75 percentiles is colored. For every set of tasks, we tested with $100$ prompts and for each prompt, every task has $20$ random in-context task examples with order randomized like in Figure \ref{['fig:fulldemo']}. Category other is the sum of probabilities of all other outputs. Gray dashed line in each figure is the ideal probability if we assume the model perfectly calibrates its output distribution to the distribution of in-context task examples. With uniform distribution of task examples, the dashed lines are at $0.25$ (4 tasks setting) and $0.33$ (3 tasks setting).
  • Figure 3: We consider two different training settings of tasks: (a) given an eight-character length string as input, consider ret1, ..., ret8 where ret1 is to retrieve the first character and so on; and (b) given a two-digit integer as an input, consider plus0, ..., plus9 where plus0 is to add $0$ on the input and so on. The model in-context learn one task at a time during training. After training, for each setting, we select two tasks ad we provide the model with prompts containing in-context examples from these two tasks and vary the mixture ratio $\lambda$ such that the in-context task example distribution for two tasks is $[\lambda, 1-\lambda]$. We plot $\lambda$ on x-axis and the output probabilities of task answers for each task on y-axis.
  • Figure 4: Task vectors of Llama-3 8B projected onto two axes chosen by LDA for two sets of tasks: (a)copy(op1), copy(op2) and op1+op2 and (b)to_fr, to_de and to_it. For tasks $t_1, t_2, t_3$, we use "$\mathbb{P}(t_1)/\mathbb{P}(t_2)/\mathbb{P}(t_3)$" to denote different levels of task mixtures, e.g., "0.50/0.50/0.00" represents the case where the in-context task examples are $50\%$$t_1$, $50\%$$t_2$ and $0\%$$t_3$.
  • Figure 5: On Llama-3 8B, we vary the proportion, $\lambda$, between two tasks and observe how the output probabilities for the correct answers change. The proportion $\lambda$ is varied in two ways: (1) in the top row, we plot the output from patching in a convex combination of task vectors for two tasks. (2) in the bottom row, we plot the output from a mixed proportion of in-context examples for the two tasks. Subplot (a) shows the output probabilities from mixing two copy tasks and (b) shows the probabilities from mixing two translate tasks.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Lemma 1
  • proof
  • Definition 1: Definition 12 in bai2023transformers
  • Definition 2: Definition A.1 in bai2023transformers
  • Proposition 1: Proposition A.1 in bai2023transformers
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 4 more