Graph Convolutions Enrich the Self-Attention in Transformers!

Jeongwhan Choi; Hyowon Wi; Jayoung Kim; Yehjin Shin; Kookjin Lee; Nathaniel Trask; Noseong Park

Graph Convolutions Enrich the Self-Attention in Transformers!

Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

TL;DR

This work proposes a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism.

Abstract

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

Graph Convolutions Enrich the Self-Attention in Transformers!

TL;DR

This work proposes a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism.

Abstract

Paper Structure (109 sections, 2 theorems, 18 equations, 8 figures, 40 tables, 1 algorithm)

This paper contains 109 sections, 2 theorems, 18 equations, 8 figures, 40 tables, 1 algorithm.

Introduction
Background & Related Work
Self-Attention in Transformers
Self-Attention and Graph Convolutional Filter
Oversmoothing in GCNs and Transformers
Graph Filter-based Self-Attention Layers
Approximation of the high-order term.
GFSA: our graph filter-based self-attention.
Properties of GFSA
Theoretical characteristics of approximation error in GFSA.
How to alleviate the oversmoothing problem?
The meaning of the high-order term in GFSA in the context of Transformers.
Comparison to Transformers.
Comparison to GCNs.
Experiments
...and 94 more sections

Key Result

Theorem 3.1

Let $\bar{\bm{A}}$ be a self-attention matrix interpreted as a graph with connected components. Consider the polynomial graph filter defined by $\sum_{k=0}^K w_k \bar{\bm{A}}^k$, where $w_2, w_3, \ldots, w_{K-1} = 0$ and only $w_0$, $w_1$, and $w_K$ are non-zero. If the coefficients $w_k$ for $k=0,1

Figures (8)

Figure 1: Performance improvements (%) of our GFSA when integrated with different Transformer backbones in various domains. We achieve these results with only tens to hundreds of additional parameters to Transformers.
Figure 2: Filter frequency response, cosine similarity, and singular values on ImageNet-1k for DeiT-S and DeiT-S + GFSA. Details and more visualizations are in Appendices \ref{['app:vis']} and \ref{['app:response']}.
Figure 3: Effectiveness of our selective layer strategy on ImageNet-1k. This shows out strategy's ability to maintain accuracy benefits while mitigating runtime increases.
Figure 4: Performance ($x$-axis), runtime ($y$-axis), and GPU usage (circle sizes) of various Transformers and integrated GFSA on Long-Range benchmark
Figure 5: Filter frequency response, cosine similarity, and singular values on STS-B for BERT and BERT+GFSA
...and 3 more figures

Theorems & Definitions (4)

Theorem 3.1: Filter characteristics based on coefficient values
Theorem 4.1: Error bound for approximated high-order term in GFSA
proof
proof

Graph Convolutions Enrich the Self-Attention in Transformers!

TL;DR

Abstract

Graph Convolutions Enrich the Self-Attention in Transformers!

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)