In-Context In-Context Learning with Transformer Neural Processes

Matthew Ashman; Cristiana Diaconu; Adrian Weller; Richard E. Turner

In-Context In-Context Learning with Transformer Neural Processes

Matthew Ashman, Cristiana Diaconu, Adrian Weller, Richard E. Turner

TL;DR

The paper tackles the limitation of standard neural processes and Transformer NPs that condition on a single dataset by introducing in-context in-context learning (ICICL). It proposes ICICL-TNP, a pseudo-token transformer that can condition on multiple in-context datasets alongside the primary context, with a formal KL-based guarantee (Theorem 1) that leveraging related datasets reduces predictive uncertainty. The approach is supported by experiments on synthetic GP-like tasks, MNIST image completion, and ERA5 environmental data, showing that ICICL recovers baseline performance without in-context data and yields meaningful gains when additional related datasets are provided. This work enables scalable, data-efficient meta-learning in settings where many related datasets share a common stochastic process, with broad potential applications in scientific modeling and data-driven inference.

Abstract

Neural processes (NPs) are a powerful family of meta-learning models that seek to approximate the posterior predictive map of the ground-truth stochastic process from which each dataset in a meta-dataset is sampled. There are many cases in which practitioners, besides having access to the dataset of interest, may also have access to other datasets that share similarities with it. In this case, integrating these datasets into the NP can improve predictions. We equip NPs with this functionality and describe this paradigm as in-context in-context learning. Standard NP architectures, such as the convolutional conditional NP (ConvCNP) or the family of transformer neural processes (TNPs), are not capable of in-context in-context learning, as they are only able to condition on a single dataset. We address this shortcoming by developing the in-context in-context learning pseudo-token TNP (ICICL-TNP). The ICICL-TNP builds on the family of PT-TNPs, which utilise pseudo-token-based transformer architectures to sidestep the quadratic computational complexity associated with regular transformer architectures. Importantly, the ICICL-TNP is capable of conditioning on both sets of datapoints and sets of datasets, enabling it to perform in-context in-context learning. We demonstrate the importance of in-context in-context learning and the effectiveness of the ICICL-TNP in a number of experiments.

In-Context In-Context Learning with Transformer Neural Processes

TL;DR

Abstract

Paper Structure (31 sections, 2 theorems, 23 equations, 15 figures, 4 tables)

This paper contains 31 sections, 2 theorems, 23 equations, 15 figures, 4 tables.

Introduction
Background
Neural Processes
Transformer Neural Processes
In-Context In-Context Learning
In-Context In-Context Learning for Mixtures of Stochastic Processes
In-Context In-Context Learning with Transformer Neural Processes
Related Work
Cross-Attention Based Architectures in NLP
Cross-Attention Based Architectures for Multi-Task Learning
Conditioning on Exchangeable Datasets in Causal ML
Experiments
Synthetic Regression
Out-of-distribution (OOD) testing
Image Completion
...and 16 more sections

Key Result

theorem 1

Let $\xi_i \sim p(\xi)$, $\mathcal{D}_i, \{\mathcal{D}_j\} \sim P(\xi_i)$. Let $p(\mathbf{y} | \mathbf{x}, \mathcal{D}_i, \xi_i)$ be the marginal posterior distribution of $P(\xi_i)$ given $\mathcal{D}_i$, $p(\mathbf{y} | \mathbf{x}, \mathcal{D}_i, \{\mathcal{D}_j\})$ be the marginal posterior distr

Figures (15)

Figure 1: A diagram illustrating the architecture with three in-context datasets. The point-wise embedding layer is used to get an initial token representation of all datapoints, including the target input location $\mathbf{x}_t$. Then for each layer of the processor, pseudo-token representations for each of the in-context datasets, $\mathbf{U}_{ic}$, and the context dataset, $\mathbf{U}$, are updated through operations. The in-context pseudo-tokens $\mathbf{U}_{ic}$ are then modulated by the context pseudo-tokens $\mathbf{U}$, followed by operations on each set of pseudo-tokens. The in-context pseudo-tokens then modulate the context pseudo-tokens, and finally the context pseudo-tokens modulate the token representation of the target input, $\mathbf{z}_t$. After $L$ layers, the processor outputs the encoder representation $e(\mathbf{x}_t, \mathcal{D}_c, \{\mathcal{D}_{ic, j}\}_{j=1}^{j=N_{ic}})$.
Figure 1: Comparison of the predictive performance (in terms of test log likelihood) between the , , and the with varying number of in-context datasets (indicated within brackets).
Figure 2: Comparison of the predictive performance (in terms of test log likelihood) when tested OOD between the , , and the with varying number of in-context datasets (indicated within brackets).
Figure 3: The difference between the predictive distributions when tested OOD of the regular and the when conditioning on three in-context datasets. The context datapoints come from a GP with a periodic kernel with $\ell = 6.08$.
Figure 4: A comparison between the predictive performance of the , regular , ICICL- and as the proportion of context datapoints $N_c / N$ varies in the MNIST in-painting experiment.
...and 10 more figures

Theorems & Definitions (3)

theorem 1: In-context in-context learning
theorem 1: In-context in-context learning
proof

In-Context In-Context Learning with Transformer Neural Processes

TL;DR

Abstract

In-Context In-Context Learning with Transformer Neural Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (3)