Table of Contents
Fetching ...

Task Vectors in In-Context Learning: Emergence, Formation, and Benefit

Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak

TL;DR

The paper investigates how task vectors—compact, task-specific representations—emerge in transformers trained from scratch on synthetic tasks and how to bolster their formation and utility. It formalizes task vectors, analyzes their natural emergence, and introduces TVP-loss, an auxiliary objective that fixes the vector’s embedding location within the model to enhance robustness and generalization for in-context learning. Across linear, sinusoidal, discrete token offset tasks and formal-language benchmarks (GINC, RegBench), task vectors arise under appropriate input formats and depths, with depth-dependent localization and susceptibility to context length; TVP-loss consistently sharpens the encoding and improves out-of-distribution performance. The work demonstrates that task-vector formation can be intentionally engineered to yield stronger, more robust in-context learning representations, enabling more reliable zero-shot task execution and informing practical design of task-aware prompting and computation-efficient inference.

Abstract

In-context learning is a remarkable capability of transformers, referring to their ability to adapt to specific tasks based on a short history or context. Previous research has found that task-specific information is locally encoded within models, though their emergence and functionality remain unclear due to opaque pre-training processes. In this work, we investigate the formation of task vectors in a controlled setting, using models trained from scratch on synthetic datasets. Our findings confirm that task vectors naturally emerge under certain conditions, but the tasks may be relatively weakly and/or non-locally encoded within the model. To promote strong task vectors encoded at a prescribed location within the model, we propose an auxiliary training mechanism based on a task vector prompting loss (TVP-loss). This method eliminates the need to search for task-correlated encodings within the trained model and demonstrably improves robustness and generalization.

Task Vectors in In-Context Learning: Emergence, Formation, and Benefit

TL;DR

The paper investigates how task vectors—compact, task-specific representations—emerge in transformers trained from scratch on synthetic tasks and how to bolster their formation and utility. It formalizes task vectors, analyzes their natural emergence, and introduces TVP-loss, an auxiliary objective that fixes the vector’s embedding location within the model to enhance robustness and generalization for in-context learning. Across linear, sinusoidal, discrete token offset tasks and formal-language benchmarks (GINC, RegBench), task vectors arise under appropriate input formats and depths, with depth-dependent localization and susceptibility to context length; TVP-loss consistently sharpens the encoding and improves out-of-distribution performance. The work demonstrates that task-vector formation can be intentionally engineered to yield stronger, more robust in-context learning representations, enabling more reliable zero-shot task execution and informing practical design of task-aware prompting and computation-efficient inference.

Abstract

In-context learning is a remarkable capability of transformers, referring to their ability to adapt to specific tasks based on a short history or context. Previous research has found that task-specific information is locally encoded within models, though their emergence and functionality remain unclear due to opaque pre-training processes. In this work, we investigate the formation of task vectors in a controlled setting, using models trained from scratch on synthetic datasets. Our findings confirm that task vectors naturally emerge under certain conditions, but the tasks may be relatively weakly and/or non-locally encoded within the model. To promote strong task vectors encoded at a prescribed location within the model, we propose an auxiliary training mechanism based on a task vector prompting loss (TVP-loss). This method eliminates the need to search for task-correlated encodings within the trained model and demonstrably improves robustness and generalization.
Paper Structure (58 sections, 8 equations, 17 figures, 1 table)

This paper contains 58 sections, 8 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Overview of the transformer operating in in-context learning (ICL) and task vector prompting (TVP) modes. A transformer can be configured to operate in ICL mode, using input-output pairs as prompts, or in TVP mode, extracting task-specific embeddings for zero-shot predictions (Architecture and training details in Section \ref{['sec:def']}). On the right, the ICL and TVP performances of the vanilla-trained model and our method are shown, with the dashed horizontal line indicating random prediction performance (i.e., no task information is inferred). Compared to vanilla training, our approach enhances task-specific representations in the TVP mode while preserving comparable ICL performance.
  • Figure 2: Attention map and PCA visualization of activations across three different linear functions. (Left) For the linear regression task described in Section \ref{['sec:exp_lr']}, the attention map illustrates how each query (rows) attends to all available keys (columns), with each row summing to 1. The heatmap reveals that the activations at the ${\bm{x}}_i$ positions predominantly attend to the activations of the preceding $y_{i-1}$, where task information is stored (highlighted by the black boxes). Additionally, the activations at the $y_i$ positions attend to both themselves and the preceding $y_{i-1}$, enabling the online updating of task information (highlighted by the white boxes). (Right) PCA visualizations of token $y_i$ activations ($i \in \{0, 2, 4, 6, 8\}$) across layers $L$ reveal that task-specific clusters (three colors correspond to three different tasks) begin to emerge at the output of the 2nd layer, indicating that the model progressively encodes task information as depth increases.
  • Figure 3: Performance of the transformer in in-context learning (ICL) mode (solid line) and task vector prompting (TVP) mode (line with triangular markers) across problem dimensions $d \in \{4, 5, 6, 7, 8, 9\}$ and varying model depths $L$. The dashed line indicates the random-guess baseline. The results indicate that smaller problem dimensions ($d = 4$ to $d = 7$) and shallower model depths ($L = 3$ to $L = 5$) yield stronger and more stable task encoding in the TVP mode. However, task encoding remains noisy in most cases. Task vectors emerge more clearly when the ICL loss plateaus but tend to disperse with increasing in-context length. In deeper models, task information appears to distribute across layers, reducing the distinctiveness of task vectors.
  • Figure 4: Layer selection distribution (upper row) and averaged task vector prompting performance (lower row) for task vector emergence in: Linear Regression (left), Sinusoidal Regression (middle), and Discrete Token Offset (right).Upper row: we show the frequency of each layer being selected as the task vector location across varying numbers of in-context examples. For the linear regression task on a 3-layer transformer, the task vector predominantly emerges in the 2nd layer. For the more complex sinusoidal regression task on a 6-layer transformer, the task vector shifts to the 3rd layer, while for the discrete token offset task, it is primarily found in the last layer. These results suggest that task complexity influences the depth at which the task vector emerges. Lower row: we present the averaged task vector prompting (TVP) performance for each layer, measured across various context lengths. The dashed line represents random-guess performance, indicating no task information is inferred. Notably, the selected task vector layer demonstrates significantly lower loss in TVP mode compared to other layers, confirming that the emergence of the task vector is meaningful rather than due to marginal differences in loss across layers.
  • Figure 5: Demonstration of our training algorithm. In vanilla Meta-ICL training, the model is updated using the ICL-loss signal from the few-shot context. To encourage the formation of task vectors, we also explicitly include the TVP-loss from the zero-shot query. This means the model is asked to predict $y_{\text{test}}$ when only ${\bm{x}}_{\text{test}}$ and the injected hidden states are given. In the given illustrated example, there are in total 2 layer in the transformer model, and we set $l=1$ to encourage the formation of the task vector at the first transformer block's output.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Remark : Task Vector Location