How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

Harmon Bhasin; Timothy Ossowski; Yiqiao Zhong; Junjie Hu

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

Harmon Bhasin, Timothy Ossowski, Yiqiao Zhong, Junjie Hu

TL;DR

The paper examines how multi-task training influences Transformer in-context capabilities by training on multiple function-class tasks using simple curriculum strategies. It introduces sequential, mixed, and random curricula and shows that a mixed curriculum yields the best data efficiency and convergence, enabling learning of harder function classes with fewer examples. Attention analysis reveals retrospective heads that consistently contribute to ICL across tasks, and masking these heads significantly harms performance, suggesting a shared mechanism for multi-task ICL. Overall, the work provides practical guidance on curriculum design to enhance ICL in Transformers and offers a foundation for extending multi-task ICL to more complex linguistic tasks and larger models.

Abstract

Large language models (LLM) have recently shown the extraordinary ability to perform unseen tasks based on few-shot examples provided as text, also known as in-context learning (ICL). While recent works have attempted to understand the mechanisms driving ICL, few have explored training strategies that incentivize these models to generalize to multiple tasks. Multi-task learning (MTL) for generalist models is a promising direction that offers transfer learning potential, enabling large parameterized models to be trained from simpler, related tasks. In this work, we investigate the combination of MTL with ICL to build models that efficiently learn tasks while being robust to out-of-distribution examples. We propose several effective curriculum learning strategies that allow ICL models to achieve higher data efficiency and more stable convergence. Our experiments reveal that ICL models can effectively learn difficult tasks by training on progressively harder tasks while mixing in prior tasks, denoted as mixed curriculum in this work. Our code and models are available at https://github.com/harmonbhasin/curriculum_learning_icl .

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 13 figures)

This paper contains 27 sections, 7 equations, 13 figures.

Introduction
Related Work
In-context Learning
Curriculum Learning
Attention Analysis
Instruction Prompting
Methods
Problem Definition
Tasks
Function Class Learning
Curriculum Learning
Sequential Curriculum
Mixed Curriculum
Random Curriculum
Attention Analysis
...and 12 more sections

Figures (13)

Figure 1: Comparison of the moving average of all three curriculum learning strategies when evaluated on a quadratic function class dataset during test time. The mixed curriculum is the only model that is able to achieve an accurate normalized MSE. The random curriculum performs comparatively worse, whereas the sequential curriculum performs substantially worse (y-axis is limited in order for mixed and random curricula to be differentiated).
Figure 1: Normalized MSE over the number of in-context examples for the mixed curriculum model, mixed curriculum model with one hot encoded instruction (OHEI) vector and mixed curriculum model with preset instruction (PI) vector. Solid line represents the moving average (window = 10) whereas the dashed line is the true value. Scientific notation is used for the y-axis. Both of our attempts at instruction prompting are unsuccessful as normalized MSE remains the same or worsens across all tasks.
Figure 2: Masking retrospective heads (bottom row) causes significant increase in normalized MSE compared to non-retrospective heads (top row) in the mixed curriculum model.
Figure 2: Attention analysis as described in Section §\ref{['attention_analysis']} for the single-task function class learning models. The linear model has different attention patterns when evaluated on the linear and cubic test time dataset as it has not seen cubic examples during training. The quadratic model has no retrospective heads as it does not converge, a fact that is made clear when analyzing normalized MSE in Supplementary Figure \ref{['suppfig:model_performance_func_learning_baseline']}. The cubic model seems to have learned the easier tasks (e.g. linear and quadratic) from learning the harder task (cubic).
Figure 3: Comparison of the moving average of five different seeded single-task (blue-purple) and mixed curriculum models (orange-red) evaluated on a quadratic function class dataset during test time. Mixed curriculum models are able to learn quadratic function classes whereas the single task models are unable to, indicated by the spikes and upward trend in normalized MSE.
...and 8 more figures

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

TL;DR

Abstract

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

Authors

TL;DR

Abstract

Table of Contents

Figures (13)