Table of Contents
Fetching ...

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana

TL;DR

The paper develops a unified Bayesian framework that treats both in-context learning (ICL) and activation steering as belief updates over latent concepts in large language models. By decomposing posterior odds into priors and likelihoods, and adopting a sub-linear evidence accumulation with context size via a power-law scaling, it yields a closed-form model: $\log o(c|x) = a \cdot m + b + \gamma N^{1-\alpha}$, and derives a phase boundary $N^*(m)$ that predicts when behavior shifts from concept $c'$ to $c$. The work demonstrates high predictive accuracy across multiple models and persona-domain datasets, explaining prior findings (sigmoidal ICL curves, additive intervention effects) and forecasting novel phenomena like joint working of ICL and steering and the sudden shifts at phase transitions. These insights offer a principled pathway for predicting, combining, and safely controlling LLM behavior at inference time, with implications for design of prompts and activation-level interventions and for understanding neural representations through a Bayesian lens.

Abstract

Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

TL;DR

The paper develops a unified Bayesian framework that treats both in-context learning (ICL) and activation steering as belief updates over latent concepts in large language models. By decomposing posterior odds into priors and likelihoods, and adopting a sub-linear evidence accumulation with context size via a power-law scaling, it yields a closed-form model: , and derives a phase boundary that predicts when behavior shifts from concept to . The work demonstrates high predictive accuracy across multiple models and persona-domain datasets, explaining prior findings (sigmoidal ICL curves, additive intervention effects) and forecasting novel phenomena like joint working of ICL and steering and the sudden shifts at phase transitions. These insights offer a principled pathway for predicting, combining, and safely controlling LLM behavior at inference time, with implications for design of prompts and activation-level interventions and for understanding neural representations through a Bayesian lens.

Abstract

Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

Paper Structure

This paper contains 29 sections, 25 equations, 15 figures.

Figures (15)

  • Figure 1: Overview of our unified Bayesian theory of in-context learning and activation steering We argue that in-context learning (ICL) and activation steering both impact behavior by updating an LLM's belief in latent concepts. We empirically test our claims in five domains of manipulating language model "persona" (bottom left) and predict that ICL will follow a sudden learning curve with increasing context length, and that this curve will be shifted under activation steering (top left). By our account, ICL with increasing context length $|x|$ and steering vectors with increasing magnitude both operate by updating an LLM's belief in latent concepts $c$.
  • Figure 2: Replication of many-shot ICL results in persona domains anil2024msj
  • Figure 3: Belief updating with concept vectors (Left) From a representational perspective, we assume that the default behavior of an LLM (e.g. Neutral Persona $c'$) and the target behavior (Target Persona $c$) correspond to concept vectors. In-context learning (blue) directs the initial belief state from $c'$ to increasingly point towards $c$ as a function of the log number of shots $|x|$. Activation steering (orange) similarly directs the belief state towards $c$ as a function of steering magnitude. (Right) We offer a parallel Bayesian perspective that in-context learning ($x_k$) and activation steering ($v$) both operate by changing an LLM's belief in latent concepts $c$. In our theory, in-context learning updates the posterior belief through the likelihood function $p(x | c)$ (where $p(c|x) \propto p(x | c)$) and activation steering intervenes on concept priors $p(c) \rightarrow p'(c)$.
  • Figure 4: In-context learning dynamics are sigmoidal with respect to $N^{1-\alpha}$ and modulated by activation steering We find sigmoidal many-shot in-context learning dynamics (solid lines) which can be effectively fit with a power law of scaling in-context data (dotted line). We additionally find that activation steering with different magnitudes (line colors) shifts in-context learning dynamics. In our belief dynamics model, this is explained by activation steering altering the LLM's belief state. Model predictions represent held-out predictions from cross-validation. Note that, since we fit our models via cross-validation, we use the average $\alpha$ fit across folds to transform the x-axis in this figure.
  • Figure 5: Change in behavior as a function of steering vector magnitude As we scale steering vector magnitude (x-axis), we find a sigmoidal response function in behavior (y-axis). With steering magnitudes in the range $[-1, 1]$, we find approximately linear effects of steering, which taper off as magnitude increases. This pattern holds across different numbers of ICL examples (different colors). This pattern is well-captured by our model, which assumes a linear impact of steering on the log prior odds, and hence a sigmoidal impact in probability space.
  • ...and 10 more figures