Table of Contents
Fetching ...

Label Words as Local Task Vectors in In-Context Learning

Bowen Zheng, Ming Ma, Zhongqiao Lin, Tianming Yang

TL;DR

This work reframes in-context learning in large language models from a single global task vector to a distributed, demonstration-specific mechanism of local task vectors, with per-demo answer-position tokens carrying critical task information. It shows that local task vectors can be patched into dummy inputs to achieve near-shot performance, especially in categorization tasks where global vectors fail, while knowledge tasks may still benefit from a convergent global representation in deeper layers. The study employs saliency analyses and demixed PCA to demonstrate the localization of information and its layerwise aggregation, revealing a nuanced, task-dependent information-aggregation process in ICL. These findings provide a mechanistic, layer-aware account of how demonstrations guide LLM behavior and clarify when global versus local representations emerge.

Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities, one of the most important being in-context learning (ICL). With ICL, LLMs can derive the underlying rule from a few demonstrations and provide answers that comply with the rule. Previous work hypothesized that the network creates a task vector in specific positions during ICL. The task vector can be computed by averaging across the dataset. It conveys the overall task information and can thus be considered global. Patching the global task vector allows LLMs to achieve zero-shot performance with dummy inputs comparable to few-shot learning. However, we find that such a global task vector does not exist in all tasks, especially in tasks that rely on rules that can only be inferred from multiple demonstrations, such as categorization tasks. Instead, the information provided by each demonstration is first transmitted to its answer position and forms a local task vector associated with the demonstration. In some tasks but not in categorization tasks, all demonstrations' local task vectors converge in later layers, forming the global task vector. We further show that local task vectors encode a high-level abstraction of rules extracted from the demonstrations. Our study provides novel insights into the mechanism underlying ICL in LLMs, demonstrating how ICL may be achieved through an information aggregation mechanism.

Label Words as Local Task Vectors in In-Context Learning

TL;DR

This work reframes in-context learning in large language models from a single global task vector to a distributed, demonstration-specific mechanism of local task vectors, with per-demo answer-position tokens carrying critical task information. It shows that local task vectors can be patched into dummy inputs to achieve near-shot performance, especially in categorization tasks where global vectors fail, while knowledge tasks may still benefit from a convergent global representation in deeper layers. The study employs saliency analyses and demixed PCA to demonstrate the localization of information and its layerwise aggregation, revealing a nuanced, task-dependent information-aggregation process in ICL. These findings provide a mechanistic, layer-aware account of how demonstrations guide LLM behavior and clarify when global versus local representations emerge.

Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities, one of the most important being in-context learning (ICL). With ICL, LLMs can derive the underlying rule from a few demonstrations and provide answers that comply with the rule. Previous work hypothesized that the network creates a task vector in specific positions during ICL. The task vector can be computed by averaging across the dataset. It conveys the overall task information and can thus be considered global. Patching the global task vector allows LLMs to achieve zero-shot performance with dummy inputs comparable to few-shot learning. However, we find that such a global task vector does not exist in all tasks, especially in tasks that rely on rules that can only be inferred from multiple demonstrations, such as categorization tasks. Instead, the information provided by each demonstration is first transmitted to its answer position and forms a local task vector associated with the demonstration. In some tasks but not in categorization tasks, all demonstrations' local task vectors converge in later layers, forming the global task vector. We further show that local task vectors encode a high-level abstraction of rules extracted from the demonstrations. Our study provides novel insights into the mechanism underlying ICL in LLMs, demonstrating how ICL may be achieved through an information aggregation mechanism.
Paper Structure (26 sections, 5 equations, 10 figures, 6 tables)

This paper contains 26 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of local and global task vectors. Left: In a categorization task, local task vectors contain information associated with each demonstration, but they do not converge and form a global task vector. Right: In a knowledge task, local task vectors aggregate into a coherent global task vector that aligns with the LLM’s prior knowledge, enabling effective task representation.
  • Figure 2: Accuracy increases sharply in the middle layers. The shades of blue indicate demonstration numbers. Top: knowledge task; Bottom: categorization task.
  • Figure 3: Saliency scores in layer 14, shown as heatmaps. The color at each location shows the saliency score between the positions in the respective row and column. The last row is where the highest scores are observed at the demonstrations' answer position, which is plotted on the right as a function of the demonstration index. (a) Knowledge task; (b) Categorization task.
  • Figure 4: Model accuracy of ablated models. The shades of blue indicate the index of layers. Top: knowledge task; Bottom: categorization task.
  • Figure 5: The encoding of question and answer information in answer positions. (a) Plotted is the Mahalanobis distance of the clusters defined by the demonstrations' string's length (blue) and by the corresponding answer (green) in the space defined by the two largest PCA components of the answer positions. Peaking at layer 2, the Mahalanobis distance for the string length gradually decreases across layers. (b) Two example layers' task spaces defined by the first two largest PCA components. Blue and red indicate answers 0 and 1, and the color gradient indicates string length. Notice that the segregation between the two answers (blue and red color) is maintained across the layers, while the dots with the same color but different gradients are more mixed in later layers (e.g. layer 12) than in earlier layers (e.g. layer 2).
  • ...and 5 more figures