Table of Contents
Fetching ...

Graph Transformers Dream of Electric Flow

Xiang Cheng, Lawrence Carin, Suvrit Sra

TL;DR

This work analyzes how a linear Transformer processing graph data via the incidence matrix can implement fundamental Laplacian-based algorithms. It provides explicit weight configurations to realize electric flow (and hence ${ ext{L}}^ op{}^ rac12$ and related operators), the heat kernel, a multiplicative polynomial expansion, and subspace iteration for computing eigenvectors, with rigorous layer-dependent error bounds. The authors also introduce a parameter-efficient variant and demonstrate that a Transformer can learn useful positional encodings for molecular regression tasks, sometimes outperforming Laplacian-based encodings. Empirical results on synthetic graphs and real-world molecular datasets corroborate the theory, showing that a few layers suffice to approximate these linear-algebraic targets and that learned PEs can improve downstream performance.

Abstract

We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The Transformer has access to information on the input graph only via the graph's incidence matrix. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data. Code is available at https://github.com/chengxiang/LinearGraphTransformer

Graph Transformers Dream of Electric Flow

TL;DR

This work analyzes how a linear Transformer processing graph data via the incidence matrix can implement fundamental Laplacian-based algorithms. It provides explicit weight configurations to realize electric flow (and hence and related operators), the heat kernel, a multiplicative polynomial expansion, and subspace iteration for computing eigenvectors, with rigorous layer-dependent error bounds. The authors also introduce a parameter-efficient variant and demonstrate that a Transformer can learn useful positional encodings for molecular regression tasks, sometimes outperforming Laplacian-based encodings. Empirical results on synthetic graphs and real-world molecular datasets corroborate the theory, showing that a few layers suffice to approximate these linear-algebraic targets and that learned PEs can improve downstream performance.

Abstract

We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The Transformer has access to information on the input graph only via the graph's incidence matrix. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data. Code is available at https://github.com/chengxiang/LinearGraphTransformer

Paper Structure

This paper contains 34 sections, 10 theorems, 17 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Consider the setup in Section ss:poly_setup. Assume that $\langle{\psi_i, \vec{1}} \rangle = 0$ for each $i=1...k$. For any $\delta > 0$ and for any $L$-layer Transformer, there exists a choice of weights $W^V, W^Q, W^K, W^R$, such that each layer of the Transformer e:z_dynamics implements a step of

Figures (3)

  • Figure 1: $\text{loss}_U$ against number of layers at convergence for $U \in \left\{{\mathcal{L}}^{\dagger}, \sqrt{{\mathcal{L}}^{\dagger}}, e^{-0.5 {\mathcal{L}}}\right\}$.
  • Figure 2: $\log(\text{loss}_*)$ vs. number of layers. Top row: various losses for the Transformer trained on $\text{loss}_{1-5}$. Bottom row: various losses for the Transformer trained on $\text{loss}_{1-10}$.
  • Figure 3: Plot of loss against number of layers for the 4 problems. Figures {\ref{['f:loss_layer_efficient']}(a), \ref{['f:loss_layer_efficient']}(b), \ref{['f:loss_layer_efficient']}(c), \ref{['f:loss_layer_efficient']}(d)} correspond to Figures {\ref{['f:electric_loss_against_layer']}(a),\ref{['f:electric_loss_against_layer']}(b),\ref{['f:electric_loss_against_layer']}(c),\ref{['f:ev_loss_against_layer']}(d)} respectively. The experiment setup of each corresponding pair of plots are identical, except for the architecture used: all plots in Figure \ref{['f:loss_layer_efficient']} are made using the efficient implementation described in Section \ref{['s:efficient']}.

Theorems & Definitions (10)

  • Lemma 1: Transformer solves Electric Flow by implementing Gradient Descent
  • Lemma 2: Principal Square Root $\sqrt{{\mathcal{L}}^{\dagger}}$
  • Lemma 3: Heat Kernel $e^{-s{\mathcal{L}}}$
  • Lemma 4
  • Lemma 5: Fast Heat Kernel
  • Lemma 6: Subspace Iteration for Finding Top $k$ Eigenvectors
  • Corollary 7: Subspace Iteration for Finding Bottom $k$ Eigenvectors
  • Lemma 8: Single Index Orthogonalization
  • Lemma 9: Constructions under \ref{['e:z_dynamics_efficient']}
  • Lemma 10: Invariance and Equivariance