Bayes' Power for Explaining In-Context Learning Generalizations

Samuel Müller; Noah Hollmann; Frank Hutter

Bayes' Power for Explaining In-Context Learning Generalizations

Samuel Müller, Noah Hollmann, Frank Hutter

TL;DR

This paper argues that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process, and shows how models become robust in-context learners by effectively composing knowledge from their training data.

Abstract

Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations' power for ICL and its usefulness to predict generalizations to previously unseen tasks. We show how models become robust in-context learners by effectively composing knowledge from their training data. We illustrate this with experiments that reveal surprising generalizations, all explicable through the exact posterior. Finally, we show the inherent constraints of the generalization capabilities of posteriors and the limitations of neural networks in approximating these posteriors.

Bayes' Power for Explaining In-Context Learning Generalizations

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 12 figures)

This paper contains 17 sections, 8 equations, 12 figures.

Introduction
Neural Network Training as Posterior Approximation
In-context Learning with Priors Over Finite Sets of Latents
Generalizations Explainable By Posterior Approximation
Training on Step Functions Yields Smooth Predictions But Not Everything Representable
Training on Sine Curves Can Yield Flat Line Predictions
Training on Sloped Lines and Flat Sines Teaches Predicting Sloped Sines
Limitations of the Posterior
Being representable is necessary but not sufficient
Bayesian Models With Misspecified Priors Become Exponentially Worse
Limitations to the Posterior Approximation Interpretation
Approximating an unknown distribution, yields unknown outcomes
The threshold of support
Architectural Limitations
Related Work
...and 2 more sections

Figures (12)

Figure 1: The model is only trained on step functions (left), still it learns to make smooth predictions (right) just like the true posterior for the step function prior.
Figure 2: Training a model on sine curves of a single amplitude, frequency and different offsets (left), the model does not only learn to model these curves (center), but also models the posterior for a sine that has a wavelength of $2$, instead of $3$. The posterior is flat, as the model is very uncertain about the offset $\Delta x$ of this curve.
Figure 3: We train a model on two distinct classes of functions, sines and sloped lines, only (left). It not only learns fit both function classes well (center), but also learns to model slightly sloped sines, when prompted with a data from a sloped sine.
Figure 4: We see that the approximations of the true posterior become better with more training steps and a lower cross-entropy loss, like we expect for a powerful model as outlined in Section \ref{['sec:background']}. The losses are negative, as we are in a regression setting, where the density can be above $1$.
Figure 5: While the model could make the optimal prediction (left) using a posterior mixing just the two latents in the center, it predicts differently as the latents in the center have a low likelihood to have generated the data.
...and 7 more figures

Bayes' Power for Explaining In-Context Learning Generalizations

TL;DR

Abstract

Bayes' Power for Explaining In-Context Learning Generalizations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)