Table of Contents
Fetching ...

An introduction to graphical tensor notation for mechanistic interpretability

Jordan K. Taylor

TL;DR

The paper argues that graphical tensor notation provides a compact, intuition-friendly language for representing linear tensor operations, with the goal of enhancing mechanistic interpretability in neural networks and transformers. It develops the notation from basic tensors to contractions, delta/isometric tensors, and common decompositions such as $M=U D V^\dagger$, $M$ expressed via $\lambda_i$, and higher-order CP/Tucker generalizations, before applying it to tensor networks and neural models. The second half adapts the framework to interpretability work in language models, loosely following transformer-circuit ideas and culminating in a handcrafted toy induction-head circuit to illustrate how composition and fixed attention patterns can encode predictive structure. The approach highlights dualities and path-wise interpretations that can simplify analysis, compression, and communication of mechanistic insights, potentially aiding debugging and transparency in large models. The techniques provide a structured lens for tracing information flow and identifying dominant terms in multi-layer attention networks, with practical impact for understanding how learned behavior emerges in transformers and beyond $M=U D V^\dagger$ and related decompositions$.$

Abstract

Graphical tensor notation is a simple way of denoting linear operations on tensors, originating from physics. Modern deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is quite important for understanding these systems. This is especially true when attempting to reverse-engineer the algorithms learned by a neural network in order to understand its behavior: a field known as mechanistic interpretability. It's often easy to get confused about which operations are happening between tensors and lose sight of the overall structure, but graphical tensor notation makes it easier to parse things at a glance and see interesting equivalences. The first half of this document introduces the notation and applies it to some decompositions (SVD, CP, Tucker, and tensor network decompositions), while the second half applies it to some existing some foundational approaches for mechanistically understanding language models, loosely following ``A Mathematical Framework for Transformer Circuits'', then constructing an example ``induction head'' circuit in graphical tensor notation.

An introduction to graphical tensor notation for mechanistic interpretability

TL;DR

The paper argues that graphical tensor notation provides a compact, intuition-friendly language for representing linear tensor operations, with the goal of enhancing mechanistic interpretability in neural networks and transformers. It develops the notation from basic tensors to contractions, delta/isometric tensors, and common decompositions such as , expressed via , and higher-order CP/Tucker generalizations, before applying it to tensor networks and neural models. The second half adapts the framework to interpretability work in language models, loosely following transformer-circuit ideas and culminating in a handcrafted toy induction-head circuit to illustrate how composition and fixed attention patterns can encode predictive structure. The approach highlights dualities and path-wise interpretations that can simplify analysis, compression, and communication of mechanistic insights, potentially aiding debugging and transparency in large models. The techniques provide a structured lens for tracing information flow and identifying dominant terms in multi-layer attention networks, with practical impact for understanding how learned behavior emerges in transformers and beyond and related decompositions

Abstract

Graphical tensor notation is a simple way of denoting linear operations on tensors, originating from physics. Modern deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is quite important for understanding these systems. This is especially true when attempting to reverse-engineer the algorithms learned by a neural network in order to understand its behavior: a field known as mechanistic interpretability. It's often easy to get confused about which operations are happening between tensors and lose sight of the overall structure, but graphical tensor notation makes it easier to parse things at a glance and see interesting equivalences. The first half of this document introduces the notation and applies it to some decompositions (SVD, CP, Tucker, and tensor network decompositions), while the second half applies it to some existing some foundational approaches for mechanistically understanding language models, loosely following ``A Mathematical Framework for Transformer Circuits'', then constructing an example ``induction head'' circuit in graphical tensor notation.
Paper Structure (14 sections, 55 equations, 4 figures, 1 table)

This paper contains 14 sections, 55 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Some examples of graphical tensor notation from the https://quimb.readthedocs.io/en/latest/index.htmlgray2018quimb
  • Figure 2: A matrix can be compressed by performing a singular value decomposition and discarding the smallest singular values. Here I treat an image as a matrix, and perform various levels of truncation, with the discarded singular values shown in the red shaded regions of the plot. (a) shows just one singular value kept: the matrix is approximated as a single outer-product of two vectors, scaled by the first singular value. (b) shows 7 singular values, (c) 30, and (d) 100 kept out of the 200 singular values in the full decomposition.
  • Figure 3: Image adapted from https://tensornetwork.org/stoudenmire2022tensornetworkorg
  • Figure 4: A tensor network diagram of GPT-2. Computationally, the network is contracted from top to bottom (input at the top, output at the bottom). Here we have shown one of the 12 layers of attention and MLP blocks, but the other layers are just repetitions of the attention and MLP blocks shown above, but with different learned weight parameters $W_Q$, $W_K$, $W_V$, $W_O$, $W_\uparrow$, $W_\downarrow$.