Table of Contents
Fetching ...

Spectral Journey: How Transformers Predict the Shortest Path

Andrew Cohen, Andrey Gromov, Kaiyu Yang, Yuandong Tian

TL;DR

This work probes whether decoder-only transformers can plan or reason by training 2-layer models from scratch to predict shortest paths on simple graphs. It reveals that the models develop edge embeddings aligned with the spectral structure of the line graph $L(G)$ and exhibit attention dynamics focused on current and target edges, enabling a novel spectral path-finding algorithm called Spectral Line Navigation (SLN). SLN, built directly from the learned representations, achieves near-perfect accuracy on the test set, underscoring that spectral methods can underlie seemingly sequential neural computations. The findings advance mechanistic interpretability in language models and suggest a principled, spectrum-based approach to graph-based reasoning with practical implications for understanding planning-like behavior in neural networks.

Abstract

Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress in this direction is to study the model's behavior in a setting with carefully controlled data. Then interpret the learned representations and reverse-engineer the computation performed internally. We study decoder-only transformer language models trained from scratch to predict shortest paths on simple, connected and undirected graphs. In this setting, the representations and the dynamics learned by the model are interpretable. We present three major results: (1) Two-layer decoder-only language models can learn to predict shortest paths on simple, connected graphs containing up to 10 nodes. (2) Models learn a graph embedding that is correlated with the spectral decomposition of the line graph. (3) Following the insights, we discover a novel approximate path-finding algorithm Spectral Line Navigator (SLN) that finds shortest path by greedily selecting nodes in the space of spectral embedding of the line graph.

Spectral Journey: How Transformers Predict the Shortest Path

TL;DR

This work probes whether decoder-only transformers can plan or reason by training 2-layer models from scratch to predict shortest paths on simple graphs. It reveals that the models develop edge embeddings aligned with the spectral structure of the line graph and exhibit attention dynamics focused on current and target edges, enabling a novel spectral path-finding algorithm called Spectral Line Navigation (SLN). SLN, built directly from the learned representations, achieves near-perfect accuracy on the test set, underscoring that spectral methods can underlie seemingly sequential neural computations. The findings advance mechanistic interpretability in language models and suggest a principled, spectrum-based approach to graph-based reasoning with practical implications for understanding planning-like behavior in neural networks.

Abstract

Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress in this direction is to study the model's behavior in a setting with carefully controlled data. Then interpret the learned representations and reverse-engineer the computation performed internally. We study decoder-only transformer language models trained from scratch to predict shortest paths on simple, connected and undirected graphs. In this setting, the representations and the dynamics learned by the model are interpretable. We present three major results: (1) Two-layer decoder-only language models can learn to predict shortest paths on simple, connected graphs containing up to 10 nodes. (2) Models learn a graph embedding that is correlated with the spectral decomposition of the line graph. (3) Following the insights, we discover a novel approximate path-finding algorithm Spectral Line Navigator (SLN) that finds shortest path by greedily selecting nodes in the space of spectral embedding of the line graph.

Paper Structure

This paper contains 24 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview(a) We train 2-layer transformers to predict nodes in the shortest path between a source and target node for a given graph represented sequentially as a list of edges and nodes, in the format of "<bos> 0 1 <e> 1 2 <e> ... <q> [source] [target] <p> [source]" (i.e. "there are edges connecting node 0 and node 1, node 1 and node 2, ..., please find shortest path between source and target"). (b) We find a strong correlation between model embeddings in layer 1 and the spectral decomposition of the graph and (c) attention head dynamics (attention activations denoted by thickness of edge) in layer 2 which attend to the the current and target node edge tokens. Using this, we derive, implement, and evaluate a novel (approximate) path-finding algorithm Spectral Line Navigation ( SLN). (d) During training, the model first learns to predict paths with 2 nodes (connected by a single edge) and then learns an algorithm for paths with $>$2 nodes. Accuracy on paths of length 3, 4 and beyond improve simultaneously. (e) After training, the model achieves $99.42\%$ accuracy on the test set and SLN achieves $99.32\%$ accuracy.
  • Figure 2: ( Top) Probability of generating a shortest path by path length for 2-layer models with 1,2,4,8 heads on the test set. Increasing the number of heads improves performance, although all models are able to perform the task with high accuracy. The worst category is the 1-head model on paths of length $6$ where the 1.5 interquartile range is above $0.95$. ( Bottom) The occurrence of samples by probability of correctness and the $\bar{\ell} - \ell^*$ (defined in Equation \ref{['complexity']}). Yellow is larger. Samples in which there are many paths of similar length between source and target ($\bar{\ell} - \ell^* \rightarrow 1$) contribute to the failures. As the number of heads increase, the model becomes more robust.
  • Figure 3: Probability distribution over $j \in \{2,3,4,5,6,7\}$ shortest paths for 2 layer models with 1,2,4, and 8 heads. Samples are grouped by the number of shortest paths between source and target in the range. We compute the probability of each of the $j$ paths and sort them in descending order. Each point is the mean and standard deviation over the test set.
  • Figure 4: Attention activations of $h_{current}$ and $h_{target}$ of the 4 head model in the final layer visualized as thickness of an edge for an example graph and shortest path query from source node ($2$) to target node ($8$). Each column corresponds to 1 iteration of generation and the current node and path so far is highlighted in blue. (Top)$h_{current}$ attends to the edge tokens corresponding to edges containing the current node in the sequence. Additionally, the relative attention activation is reduced for the edge connecting the current node to the previous node. (Bottom)$h_{target}$ attends to the edge tokens corresponding to edges containing the target node.
  • Figure 5: Cosine similarity of distance matrices between the top $20$ principal components of edge token embeddings and the eigenvector coefficients of the $20$ smallest non-zero eigenvalues of the normalized Laplacian of $L(G)$ over $10000$ random samples. For each sample, we apply remap$100$ times. For 1, 2 and 4 heads, the maximums (red square) are $0.928$, $0.907$ and $0.909$, respectively with $4$ eigenvector and $4$ PCA coefficients. For 8 heads, the maximum is $0.826$ with $7$ eigenvector and $20$ PCA coefficients.
  • ...and 4 more figures