Table of Contents
Fetching ...

Even Sparser Graph Transformers

Hamed Shirzad, Honghao Lin, Balaji Venkatachalam, Ameya Velingker, David Woodruff, Danica Sutherland

TL;DR

It is established theoretical conditions when a narrow network's attention scores can match those of a wide network, and it is shown that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

Abstract

Graph Transformers excel in long-range dependency modeling, but generally require quadratic memory complexity in the number of nodes in an input graph, and hence have trouble scaling to large graphs. Sparse attention variants such as Exphormer can help, but may require high-degree augmentations to the input graph for good performance, and do not attempt to sparsify an already-dense input graph. As the learned attention mechanisms tend to use few of these edges, such high-degree connections may be unnecessary. We show (empirically and with theoretical backing) that attention scores on graphs are usually quite consistent across network widths, and use this observation to propose a two-stage procedure, which we call Spexphormer: first, train a narrow network on the full augmented graph. Next, use only the active connections to train a wider network on a much sparser graph. We establish theoretical conditions when a narrow network's attention scores can match those of a wide network, and show that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

Even Sparser Graph Transformers

TL;DR

It is established theoretical conditions when a narrow network's attention scores can match those of a wide network, and it is shown that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

Abstract

Graph Transformers excel in long-range dependency modeling, but generally require quadratic memory complexity in the number of nodes in an input graph, and hence have trouble scaling to large graphs. Sparse attention variants such as Exphormer can help, but may require high-degree augmentations to the input graph for good performance, and do not attempt to sparsify an already-dense input graph. As the learned attention mechanisms tend to use few of these edges, such high-degree connections may be unnecessary. We show (empirically and with theoretical backing) that attention scores on graphs are usually quite consistent across network widths, and use this observation to propose a two-stage procedure, which we call Spexphormer: first, train a narrow network on the full augmented graph. Next, use only the active connections to train a wider network on a much sparser graph. We establish theoretical conditions when a narrow network's attention scores can match those of a wide network, and show that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

Paper Structure

This paper contains 52 sections, 7 theorems, 26 equations, 22 figures, 9 tables, 2 algorithms.

Key Result

Lemma E.1

Assume $0 < \epsilon, \delta < \frac{1}{2}$ and any positive integer $D$, if $d = \mathcal{O}(\frac{\log(1/\delta)}{\epsilon^2})$, there exist a distribution over matrices $\mathbf{M} \in \mathbb{R}^{d \times D}$ that for any $x \in \mathbb{R}^{D}$ and $\lVert x \rVert = 1$:

Figures (22)

  • Figure 1: Figure (a) shows a very simple synthetic graph where each node has a binary classification task of determining whether there exists a node of the opposite color in the same connected component. This task requires learning long-range dependencies. Figure (b) shows a natural clustering of the graph. This clustering would mean no node can do its task if models are trained only on one cluster at a time. Figure (c) shows a neighbor sampling starting from the green node, where random sampling fails to select the single important edge that bridges to the different-colored nodes. Figure (d) shows a random subset sampling strategy, where the task is solvable if and if only the two sides of the bridge between the two colors get selected. If we increase the size of each cluster, while keeping just one edge between two colors, the probability of selecting the bridge in any batch goes to zero, and thus the training will fail in this scenario. (e) shows attention scores between the nodes if trained with an attention-based network. Dashed lines have near zero attention scores, and thicker lines indicate a larger attention score. Knowing these attention scores will mean each node with just one directional edge can do the task perfectly. The attention edges are shown in (f). In case two nodes are equally informative; selecting either of them leads to the correct result.
  • Figure 2: Steps of our method. (a) The attention mechanism for the attention score estimator network combines graph edges with an expander graph and self-loops. The expander graphs are constructed by combining a small number of Hamiltonian cycles -- here two, in red and in purple -- then confirming the spectral gap is large enough. (b) Self-attention layers in the estimator network use this sparse attention mechanism; its self-attention layers normalize $\mathbf{V}$. (c, d) Attention scores are extracted from this network for each layer, and used to sample, in (e), a sparse directed graph, which becomes the attention graph for the final network (f). This network, with a much larger feature dimension, does not normalize $\mathbf{V}$.
  • Figure 3: Energy distance between the attention scores of various networks to a network of width 64. "Uniform" refers to the baseline placing equal scores to each neighbor, while "random" refers to the baseline with uniformly distributed logits. The remaining bars refer to networks trained on the appropriately labeled width.
  • Figure 4: Memory usage comparison: Attention Score Estimator network and Spexphormer vs. Exphormer with expander degrees 6 and 30. Exphormer with degree 30 for the ogbn-arxiv dataset could not fit into the memory of a 40GB GPU device, and thus the number here is a lower bound.
  • Figure 5: The memory and runtime trade-off for the ogbn-proteins and ogbn-arxiv datasets. The plot demonstrates that memory and time can be effectively exchanged in our approach. The reported runtime includes the whole process of preprocessing the batches, train, and validation on validation and test sets. All experiments were conducted on a V100 GPU with 32GB of memory.
  • ...and 17 more figures

Theorems & Definitions (10)

  • Lemma E.1: Johnson-Lindenstrauss Transform Lemma (JLT)
  • Corollary E.2
  • Corollary E.3: JLT-dot product
  • Theorem E.4
  • proof
  • Theorem E.5
  • Lemma E.6: Matrix Bernstein inequality
  • proof
  • Proposition E.7
  • proof