Table of Contents
Fetching ...

In-Context Learning of Energy Functions

Rylan Schaeffer, Mikail Khona, Sanmi Koyejo

TL;DR

The paper generalizes in-context learning by representing the in-context distribution as a Boltzmann form $p_{ heta}^{ICL}({\bm{x}}|\mathcal{D}) = \frac{\exp(-E_{ heta}^{ICL}({\bm{x}}|\mathcal{D}))}{Z_{\theta}}$, allowing an unconstrained energy function $E_{ heta}^{ICL}({\bm{x}}|\mathcal{D})$ to model arbitrary conditionals. It trains a causal transformer to adapt $E_{ heta}^{ICL}({\bm{x}}|\mathcal{D})$ based on in-context data, using contrastive divergence to handle the partition function and derive a gradient involving both real and confabulated samples. Sampling from the conditional uses Langevin dynamics ${\bm{x}}_{t+1}^{-} \leftarrow {\bm{x}}_{t}^{-} - \alpha \nabla_{ {\bm{x}}} E_{\theta}^{ICL}({\bm{x}}_{t}^{-}|\mathcal{D}) + {\omega}_{t}$. Preliminary experiments on synthetic mixture-of-Gaussians with a causal-transformer energy model show adaptive, sharpened energy landscapes as in-context data accumulate, marking the first instance of ICL where input and output spaces differ and suggesting a broader applicability of energy-based ICL.

Abstract

In-context learning is a powerful capability of certain machine learning models that arguably underpins the success of today's frontier AI models. However, in-context learning is critically limited to settings where the in-context distribution of interest $p_θ^{ICL}( x|\mathcal{D})$ can be straightforwardly expressed and/or parameterized by the model; for instance, language modeling relies on expressing the next-token distribution as a categorical distribution parameterized by the network's output logits. In this work, we present a more general form of in-context learning without such a limitation that we call \textit{in-context learning of energy functions}. The idea is to instead learn the unconstrained and arbitrary in-context energy function $E_θ^{ICL}(x|\mathcal{D})$ corresponding to the in-context distribution $p_θ^{ICL}(x|\mathcal{D})$. To do this, we use classic ideas from energy-based modeling. We provide preliminary evidence that our method empirically works on synthetic data. Interestingly, our work contributes (to the best of our knowledge) the first example of in-context learning where the input space and output space differ from one another, suggesting that in-context learning is a more-general capability than previously realized.

In-Context Learning of Energy Functions

TL;DR

The paper generalizes in-context learning by representing the in-context distribution as a Boltzmann form , allowing an unconstrained energy function to model arbitrary conditionals. It trains a causal transformer to adapt based on in-context data, using contrastive divergence to handle the partition function and derive a gradient involving both real and confabulated samples. Sampling from the conditional uses Langevin dynamics . Preliminary experiments on synthetic mixture-of-Gaussians with a causal-transformer energy model show adaptive, sharpened energy landscapes as in-context data accumulate, marking the first instance of ICL where input and output spaces differ and suggesting a broader applicability of energy-based ICL.

Abstract

In-context learning is a powerful capability of certain machine learning models that arguably underpins the success of today's frontier AI models. However, in-context learning is critically limited to settings where the in-context distribution of interest can be straightforwardly expressed and/or parameterized by the model; for instance, language modeling relies on expressing the next-token distribution as a categorical distribution parameterized by the network's output logits. In this work, we present a more general form of in-context learning without such a limitation that we call \textit{in-context learning of energy functions}. The idea is to instead learn the unconstrained and arbitrary in-context energy function corresponding to the in-context distribution . To do this, we use classic ideas from energy-based modeling. We provide preliminary evidence that our method empirically works on synthetic data. Interestingly, our work contributes (to the best of our knowledge) the first example of in-context learning where the input space and output space differ from one another, suggesting that in-context learning is a more-general capability than previously realized.
Paper Structure (7 sections, 8 equations, 2 figures)

This paper contains 7 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: In-Context Learning of Energy Functions. Transformers learn to compute energy functions $E_{\theta}^{ICL}({\bm{x}}|\mathcal{D})$ corresponding to probability distributions $p^{ICL}({\bm{x}}|\mathcal{D})$, where $\mathcal{D}$ are in-context datasets that vary during pretraining. At inference time, when conditioned on a new in-context dataset, the transformer computes a new energy function using fixed network parameters $\theta$. The transformers' energy landscapes progressively sharpen as additional in-context training data are conditioned upon (left to right). Bottom. The energy function $E_{\theta}^{ICL}({\bm{x}}|\mathcal{D})$ can be used to compute a gradient with respect to ${\bm{x}}$ that enables sampling higher probability points, without requiring a restricted parametric form for the corresponding conditional probability distribution $p_{\theta}^{ICL}({\bm{x}}|\mathcal{D})$.
  • Figure 2: Pseudocode for Training In-Context Learning of Energy Functions.