Table of Contents
Fetching ...

Monotonic Representation of Numeric Properties in Language Models

Benjamin Heinzerling, Kentaro Inui

TL;DR

This work investigates how language models encode numeric properties such as birth years by identifying low-dimensional, monotonic subspaces that correlate with expressed quantities. It uses partial least squares regression to find property-encoding directions from entity prompts and activation representations, and then tests causality by activation patching along these directions, observing monotonic changes in outputs with notable side effects. Across multiple models and six numeric properties, the authors show that most numeric attributes are predictable from 2–6 dimensional subspaces, and that perturbations along these directions causally shift the model outputs in a monotonic fashion. The findings suggest that monotonic representations of numeric properties emerge during pretraining and provide a framework for interpretable and controllable interventions in LM behavior, with implications for interpretability and alignment.

Abstract

Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model's internal representations is not understood well. Here, we introduce a simple method for finding and editing representations of numeric properties such as an entity's birth year. Empirically, we find low-dimensional subspaces that encode numeric properties monotonically, in an interpretable and editable fashion. When editing representations along directions in these subspaces, LM output changes accordingly. For example, by patching activations along a "birthyear" direction we can make the LM express an increasingly late birthyear: Karl Popper was born in 1929, Karl Popper was born in 1957, Karl Popper was born in 1968. Property-encoding directions exist across several numeric properties in all models under consideration, suggesting the possibility that monotonic representation of numeric properties consistently emerges during LM pretraining. Code: https://github.com/bheinzerling/numeric-property-repr

Monotonic Representation of Numeric Properties in Language Models

TL;DR

This work investigates how language models encode numeric properties such as birth years by identifying low-dimensional, monotonic subspaces that correlate with expressed quantities. It uses partial least squares regression to find property-encoding directions from entity prompts and activation representations, and then tests causality by activation patching along these directions, observing monotonic changes in outputs with notable side effects. Across multiple models and six numeric properties, the authors show that most numeric attributes are predictable from 2–6 dimensional subspaces, and that perturbations along these directions causally shift the model outputs in a monotonic fashion. The findings suggest that monotonic representations of numeric properties emerge during pretraining and provide a framework for interpretable and controllable interventions in LM behavior, with implications for interpretability and alignment.

Abstract

Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model's internal representations is not understood well. Here, we introduce a simple method for finding and editing representations of numeric properties such as an entity's birth year. Empirically, we find low-dimensional subspaces that encode numeric properties monotonically, in an interpretable and editable fashion. When editing representations along directions in these subspaces, LM output changes accordingly. For example, by patching activations along a "birthyear" direction we can make the LM express an increasingly late birthyear: Karl Popper was born in 1929, Karl Popper was born in 1957, Karl Popper was born in 1968. Property-encoding directions exist across several numeric properties in all models under consideration, suggesting the possibility that monotonic representation of numeric properties consistently emerges during LM pretraining. Code: https://github.com/bheinzerling/numeric-property-repr
Paper Structure (23 sections, 5 figures, 2 tables)

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Sketch of our main finding. Patching entity representations along specific directions in activation space yields corresponding changes in model output.
  • Figure 2: Low-dimensional subspaces of Llama-2-13B's 5120-dimensional activation space are predictive of the quantity expressed by the LM when queried for a numeric attribute of an entity, across six different numeric properties. Each subfigure shows the performance of a regression model fitted to predict the expressed quantities from LM-internal entity representations (in layer $l=0.3$), as a function of the number of PCA/PLS components used for prediction. Unlike regression on PCA components (dashed orange), partial least squares regression (PLS, solid blue) identifies a small set of predictive components. Controls with shuffled labels (dotted green, dash-dotted red) and random entity representations (long-dash-dot purple, dash-dot-dot brown) fail to find predictive subspaces.
  • Figure 3: Projection onto the top two components of per-property partial least squares regressions reveals monotonic structure in LM representations. We first fit a PLS model on Llama 2 13B entity representations from our training split for each property, project entity representations from the test split, and then plot the resulting 2-d projections. Each dot represents one entity and color saturation represents the value of the corresponding entity attribute. See units for each property in Table \ref{['tbl:data_sample_small']}.
  • Figure 4: Effect of activation patching along property-specific directions across several numeric properties. Each subplot shows the change in the numeric attribute value expressed by Llama 2 13B, as a function of the edit weight $\alpha_s$. Dark red lines indicate means across 100 entities sampled from held-out test sets and bands show standard deviations.
  • Figure 5: Mean-aggregated effects and side effects when performing activation patching along property-specific directions in activation space. Diagonal entries (top-left to bottom right) show the effect on the targeted property in terms of mean Spearman correlation between edit weight $alpha_s,k$ and expressed quantity $y_s,k$. For example, patching an entity representation along a "birthyear" direction results in a corresponding change in the quantity expressed by Llama 2 13B with a correlation strength of $0.84$. Off-diagonal entries show the side-effects of activation patching, e.g., "birthyear" patches affect LM output when queried for an entity's death year with a correlation strength of $0.68$.

Theorems & Definitions (2)

  • Definition 1: Linear representation of numeric properties, adapted from jiang2024origins
  • Definition 2: Monotonic representation of numeric properties