Table of Contents
Fetching ...

A Theory for Compressibility of Graph Transformers for Transductive Learning

Hamed Shirzad, Honghao Lin, Ameya Velingker, Balaji Venkatachalam, David Woodruff, Danica Sutherland

TL;DR

Some theoretical bounds are established on how and under what conditions the hidden dimension of these networks can be compressed, which apply to both sparse and dense variants of Graph Transformers.

Abstract

Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.

A Theory for Compressibility of Graph Transformers for Transductive Learning

TL;DR

Some theoretical bounds are established on how and under what conditions the hidden dimension of these networks can be compressed, which apply to both sparse and dense variants of Graph Transformers.

Abstract

Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.

Paper Structure

This paper contains 33 sections, 17 theorems, 54 equations, 1 figure, 4 tables.

Key Result

Lemma 3.1

Assume $0 < \epsilon, \delta < \frac{1}{2}$ and any positive integer $D$, if $d = \mathcal{O}(\frac{\log(1/\delta)}{\epsilon^2})$, there exists a distribution over matrices $\mathbf{M} \in \mathbb{R}^{d \times D}$ that for any $x \in \mathbb{R}^{D}$ and $\lVert x \rVert = 1$,

Figures (1)

  • Figure 1: Comparison of the results from a relatively large network with hidden dimension 64 and a small network with hidden dimension 4.

Theorems & Definitions (33)

  • Lemma 3.1: Johnson-Lindenstrauss Transform Lemma, JLT
  • Corollary 3.2: JLT-dot product
  • Theorem 3.3
  • Proposition 4.1
  • Theorem 4.2
  • Proposition 4.3
  • Theorem 4.4
  • proof
  • proof
  • Lemma C.1
  • ...and 23 more