Table of Contents
Fetching ...

Monitoring Latent World States in Language Models with Propositional Probes

Jiahai Feng, Stuart Russell, Jacob Steinhardt

TL;DR

This work investigates whether language models encode a latent, symbolic world model in their internal activations. It introduces propositional probes that use domain probes and a Hessian-derived binding subspace to extract propositions binding names to attributes, preserving compositional structure. The authors show decoded propositions stay faithful in several adversarial settings (prompt injections, backdoors, gender bias) even when the model's outputs do not, suggesting a robust latent world state. The results motivate interpretable monitoring tools for inference-time governance and point to future work on scaling binding mechanisms and role-filler representations.

Abstract

Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do not (''Greg'' and ''physicist''). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context -- prompt injections, backdoor attacks, and gender bias -- the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

Monitoring Latent World States in Language Models with Propositional Probes

TL;DR

This work investigates whether language models encode a latent, symbolic world model in their internal activations. It introduces propositional probes that use domain probes and a Hessian-derived binding subspace to extract propositions binding names to attributes, preserving compositional structure. The authors show decoded propositions stay faithful in several adversarial settings (prompt injections, backdoors, gender bias) even when the model's outputs do not, suggesting a robust latent world state. The results motivate interpretable monitoring tools for inference-time governance and point to future work on scaling binding mechanisms and role-filler representations.

Abstract

Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do not (''Greg'' and ''physicist''). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context -- prompt injections, backdoor attacks, and gender bias -- the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.
Paper Structure (29 sections, 6 equations, 19 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Left: Name (blue) and country probes (green) classify activations into either a name/country or a null value. Right: Activations have a lexical component (e.g. $f_E(\text{"Alice"})$) and a binding component (e.g. $b_E(0)$), such that bound activations have similar binding components (e.g. $b_E(0)$ and $b_A(0)$). We use this to compose across tokens.
  • Figure 2: To create our datasets, we first generate sets of random propositions about two people. Each set is formatted with a template (synth), rewritten into a story (para), and translated into Spanish (trans). We train probes to predict propositions from the easy synth dataset, and test probes on the hard para and trans datasets.
  • Figure 3: Overview of Hessian-based algorithm. 1) Activations for "Alice" and "Laos" are bound because their binding vectors (horizontal) align under binding matrix ${\bm{H}}$, likewise for "Bob" and "Peru". 2) Ablate binding information by setting binding vectors to midpoints. 3) Perturb activations with $\pm x$ and $\pm y$; binding is recovered in figure because $x$ and $y$ are aligned. 4a, 4b) To compute binding strength $F(x,y)$, append query strings and measure the probability of correct next token.
  • Figure 4: The accuracy of swapping binding information in name (attribute) activations by projecting into ${\bm{U}}_{(k)}$ (${\bm{V}}_{(k)}$) against $k$ in a context with 3 names and 3 attributes. We test the subspaces from the Hessian (blue), a random baseline (orange), and a skyline subspace obtained by estimating the subspace spanned by the first 3 binding vectors. We perform all 3 pairwise switches: 0-1 represents swapping the binding information of $E_0$ and $E_1$ ($A_0$ and $A_1$), and so on.
  • Figure 5: Similarity between token activations under the binding similarity metric $d(\cdot, \cdot)$ for two-entity serial (left) and parallel (middle) contexts. Right: Three-entity serial context.
  • ...and 14 more figures