Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Jelena Bratulić; Sudhanshu Mittal; David T. Hoffmann; Samuel Böhm; Robin Tibor Schirrmeister; Tonio Ball; Christian Rupprecht; Thomas Brox

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox

TL;DR

The paper tackles how In-Context Learning (ICL) can emerge beyond language in transformer models by examining the learning dynamics of induction heads. It shows that enforcing exact token copies in training sequences (instCopy) simplifies the look-up function and promotes the formation of the previous-token head, enabling ICL across visual datasets and EEG—conditions under which ICL was previously unstable or absent. The study also reveals that making the In-Weight Learning (IWL) task sufficiently challenging (via more classes or label noise, or instance discrimination) promotes ICL, highlighting a crucial ICL/IWL interplay. Collectively, these findings broaden ICL applicability to noisy real-world modalities, enabling rapid adaptation to new visual and EEG tasks without weight updates, with implications for cross-domain generalization and real-time brain-computer interfaces.

Abstract

Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables the model to perform new tasks conditioning only on the examples provided in the context without updating the model's weights. While ICL offers fast adaptation across natural language tasks and domains, its emergence is less straightforward for modalities beyond text. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL for autoregressive models and various modalities by promoting the learning of the needed mechanisms for ICL. We identify exact token repetitions in the training data sequences as an important factor for ICL. Such repetitions further improve stability and reduce transiency in ICL performance. Moreover, we emphasise the significance of training task difficulty for the emergence of ICL. Finally, by applying our novel insights on ICL emergence, we unlock ICL capabilities for various visual datasets and a more challenging EEG classification task.

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

TL;DR

Abstract

Paper Structure (29 sections, 18 figures, 1 table)

This paper contains 29 sections, 18 figures, 1 table.

Introduction
Related work
Experimental setup
How to enable ICL?
Why is ICL learned and non-transient on text but not on visual data?
Why do exact copies help?
What unlocks ICL for various visual datasets?
Does in-weight learning (IWL) task influence ICL?
Enabling ICL for EEG classification
Discussion
Acknowledgments
Model details
Experiment setup details
Datasets
Promoting ICL through exact repetitions
...and 14 more sections

Figures (18)

Figure 1: A) ICL requires two operations: a similarity function and a head that attends to the previous token for knowledge aggregation; together, they present an induction head. B) A similarity function needs to be established for the previous-token heads to form. Still, the similarity function has no purpose if it can not be associated with relevant knowledge. C) The formation of a previous-token head should be promoted by simplifying the similarity function -- by including exact token copies in the sequence. D) Enforcing exact copies in the sequences enables ICL for noisy and complex data beyond text, such as images and EEG.
Figure 2: We train GPT-2 as a next-token prediction from scratch with image-label pairs forming a sequence with control of the training sequence distribution.
Figure 3: Different training and evaluation sequences with the main difference being the number of repetitions and the use of identical copies in the context.
Figure 4: Exact copies in the context (instCopy) promote ICL performance and reduce transiency. Only a single copy ensures ICL emergence (bursty (low) case).
Figure 5: We observe clear induction head and ICL emergence during inference only for the model trained with burstiness and exact copies (bursty + InstCopy). Attention patterns in the QK space reveal a previous-token head in layer one (diagonal with offset 1) and a query token attending to the most similar label tokens in layer two. On the right, we show average QK scores over the previous-token head positions for these models during training. High attention scores for instCopy sequences confirms previous-token head formation.
...and 13 more figures

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

TL;DR

Abstract

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Authors

TL;DR

Abstract

Table of Contents

Figures (18)