Table of Contents
Fetching ...

Graph Convolutions Enrich the Self-Attention in Transformers!

Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

TL;DR

This work proposes a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism.

Abstract

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

Graph Convolutions Enrich the Self-Attention in Transformers!

TL;DR

This work proposes a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism.

Abstract

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
Paper Structure (109 sections, 2 theorems, 18 equations, 8 figures, 40 tables, 1 algorithm)

This paper contains 109 sections, 2 theorems, 18 equations, 8 figures, 40 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\bar{\bm{A}}$ be a self-attention matrix interpreted as a graph with connected components. Consider the polynomial graph filter defined by $\sum_{k=0}^K w_k \bar{\bm{A}}^k$, where $w_2, w_3, \ldots, w_{K-1} = 0$ and only $w_0$, $w_1$, and $w_K$ are non-zero. If the coefficients $w_k$ for $k=0,1

Figures (8)

  • Figure 1: Performance improvements (%) of our GFSA when integrated with different Transformer backbones in various domains. We achieve these results with only tens to hundreds of additional parameters to Transformers.
  • Figure 2: Filter frequency response, cosine similarity, and singular values on ImageNet-1k for DeiT-S and DeiT-S + GFSA. Details and more visualizations are in Appendices \ref{['app:vis']} and \ref{['app:response']}.
  • Figure 3: Effectiveness of our selective layer strategy on ImageNet-1k. This shows out strategy's ability to maintain accuracy benefits while mitigating runtime increases.
  • Figure 4: Performance ($x$-axis), runtime ($y$-axis), and GPU usage (circle sizes) of various Transformers and integrated GFSA on Long-Range benchmark
  • Figure 5: Filter frequency response, cosine similarity, and singular values on STS-B for BERT and BERT+GFSA
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 3.1: Filter characteristics based on coefficient values
  • Theorem 4.1: Error bound for approximated high-order term in GFSA
  • proof
  • proof