Table of Contents
Fetching ...

Bayes' Power for Explaining In-Context Learning Generalizations

Samuel Müller, Noah Hollmann, Frank Hutter

TL;DR

This paper argues that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process, and shows how models become robust in-context learners by effectively composing knowledge from their training data.

Abstract

Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations' power for ICL and its usefulness to predict generalizations to previously unseen tasks. We show how models become robust in-context learners by effectively composing knowledge from their training data. We illustrate this with experiments that reveal surprising generalizations, all explicable through the exact posterior. Finally, we show the inherent constraints of the generalization capabilities of posteriors and the limitations of neural networks in approximating these posteriors.

Bayes' Power for Explaining In-Context Learning Generalizations

TL;DR

This paper argues that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process, and shows how models become robust in-context learners by effectively composing knowledge from their training data.

Abstract

Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations' power for ICL and its usefulness to predict generalizations to previously unseen tasks. We show how models become robust in-context learners by effectively composing knowledge from their training data. We illustrate this with experiments that reveal surprising generalizations, all explicable through the exact posterior. Finally, we show the inherent constraints of the generalization capabilities of posteriors and the limitations of neural networks in approximating these posteriors.
Paper Structure (17 sections, 8 equations, 12 figures)

This paper contains 17 sections, 8 equations, 12 figures.

Figures (12)

  • Figure 1: The model is only trained on step functions (left), still it learns to make smooth predictions (right) just like the true posterior for the step function prior.
  • Figure 2: Training a model on sine curves of a single amplitude, frequency and different offsets (left), the model does not only learn to model these curves (center), but also models the posterior for a sine that has a wavelength of $2$, instead of $3$. The posterior is flat, as the model is very uncertain about the offset $\Delta x$ of this curve.
  • Figure 3: We train a model on two distinct classes of functions, sines and sloped lines, only (left). It not only learns fit both function classes well (center), but also learns to model slightly sloped sines, when prompted with a data from a sloped sine.
  • Figure 4: We see that the approximations of the true posterior become better with more training steps and a lower cross-entropy loss, like we expect for a powerful model as outlined in Section \ref{['sec:background']}. The losses are negative, as we are in a regression setting, where the density can be above $1$.
  • Figure 5: While the model could make the optimal prediction (left) using a posterior mixing just the two latents in the center, it predicts differently as the latents in the center have a low likelihood to have generated the data.
  • ...and 7 more figures