Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Mark Rofin; Jalal Naghiyev; Michael Hahn

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Mark Rofin, Jalal Naghiyev, Michael Hahn

Abstract

Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Abstract

Paper Structure (35 sections, 4 theorems, 33 equations, 20 figures, 9 tables)

This paper contains 35 sections, 4 theorems, 33 equations, 20 figures, 9 tables.

Introduction
Setup
Information Flow Decomposition
Gradient Decomposition
Ablating Pre-Caching and Circuit Sharing
The Roles of Pre-Caching and Circuit Sharing
Estimating the Effect of Gradient Components on Feature Emergence
Experiments with small Transformers
In Toy Tasks, NTP-Useless Features Do Not Emerge Without Pre-Caching and Circuit Sharing
Pre-Caching and Circuit Sharing Hold Together OthelloGPT's World Model
Pre-Caching is Required for Coherent Text Generation, but not Needed for Syntax
Investigating Features in Large Language Models
Finding and Understanding Pre-Cached Features in an LLM
Pre-Caching and Look-ahead Are Separate Phenomena
Related Work
...and 20 more sections

Key Result

Proposition 3.1

For any layer $k$ and position $i$,

Figures (20)

Figure 1: Left: An illustration of the information-flow decomposition for $i = 2$ and $k = 1$ into direct, pre-cached, and shared components. Direct and pre-cached paths must pass through the residual stream at position $i = 2$ and layer $k = 1$; any other path is considered shared. The turquoise rectangle indicates $r_{\theta, i = 2}^{k=1}$. Right: by decomposing the loss gradients, in Section \ref{['sec:attribution']} we partition the improvements in feature linearity at each training step.
Figure 2: Performance of the linear probes applied to the residual stream after the first Transformer block. Left: Majority, the feature "majority-so-far". Right: Conditioned Majority, the feature "previous-token". In both cases, the feature is NTP-useless until the 10th token. In both cases, ablating pre-caching and circuit sharing substantially hurts the probing performance.
Figure 3: Left: the gap in probing performance between NTP-useful and NTP-useless board squares for OthelloGPT. Right: 95% confidence intervals for the integrated influence of each component on the representation of NTP-useful and NTP-useless features.
Figure 4: Comparison of the three distinct types of features in a small language model: POS tags, dependency tags, and the positional feature. Left: probing scores for myopic and non-myopic models. Center: direct and pre-cached influence components for each feature type.
Figure 5: Development of the direct and pre-cached influence components of the positional feature in the non-myopic model during training.
...and 15 more figures

Theorems & Definitions (9)

Proposition 3.1: Loss gradients decomposition
Definition 3.2
Definition 3.3
Remark 3.4: informal
Proposition 5.1: Approximating influence with interventions
Proposition A.1: Restated from Proposition \ref{['thm:loss_decomposition']}
proof
Proposition C.1: Restated from Proposition \ref{['thm:sae_proposition']}
proof

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Abstract

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Authors

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (9)