Table of Contents
Fetching ...

Taming Gradient Oversmoothing and Expansion in Graph Neural Networks

MoonJeong Park, Dongwoo Kim

TL;DR

A simple yet effective normalization method is provided to prevent the gradient expansion and it is revealed that constraining the Lipschitz bound of each layer can neutralize the gradient expansion.

Abstract

Oversmoothing has been claimed as a primary bottleneck for multi-layered graph neural networks (GNNs). Multiple analyses have examined how and why oversmoothing occurs. However, none of the prior work addressed how optimization is performed under the oversmoothing regime. In this work, we show the presence of $\textit{gradient oversmoothing}$ preventing optimization during training. We further analyze that GNNs with residual connections, a well-known solution to help gradient flow in deep architecture, introduce $\textit{gradient expansion}$, a phenomenon of the gradient explosion in diverse directions. Therefore, adding residual connections cannot be a solution for making a GNN deep. Our analysis reveals that constraining the Lipschitz bound of each layer can neutralize the gradient expansion. To this end, we provide a simple yet effective normalization method to prevent the gradient expansion. An empirical study shows that the residual GNNs with hundreds of layers can be efficiently trained with the proposed normalization without compromising performance. Additional studies show that the empirical observations corroborate our theoretical analysis.

Taming Gradient Oversmoothing and Expansion in Graph Neural Networks

TL;DR

A simple yet effective normalization method is provided to prevent the gradient expansion and it is revealed that constraining the Lipschitz bound of each layer can neutralize the gradient expansion.

Abstract

Oversmoothing has been claimed as a primary bottleneck for multi-layered graph neural networks (GNNs). Multiple analyses have examined how and why oversmoothing occurs. However, none of the prior work addressed how optimization is performed under the oversmoothing regime. In this work, we show the presence of preventing optimization during training. We further analyze that GNNs with residual connections, a well-known solution to help gradient flow in deep architecture, introduce , a phenomenon of the gradient explosion in diverse directions. Therefore, adding residual connections cannot be a solution for making a GNN deep. Our analysis reveals that constraining the Lipschitz bound of each layer can neutralize the gradient expansion. To this end, we provide a simple yet effective normalization method to prevent the gradient expansion. An empirical study shows that the residual GNNs with hundreds of layers can be efficiently trained with the proposed normalization without compromising performance. Additional studies show that the empirical observations corroborate our theoretical analysis.
Paper Structure (27 sections, 4 theorems, 25 equations, 16 figures, 1 table)

This paper contains 27 sections, 4 theorems, 25 equations, 16 figures, 1 table.

Key Result

Lemma 1

blakely2021time Let $\mathsf{LGN}$ be $L$-layered linear GCN without an activation function, and $\mathcal{L}_\mathsf{LGN}({\mathbf{W}}, {\mathbf{X}}, {\mathbf{y}})$ be a loss function of the linear GCN with a set of parameters ${\mathbf{W}} = ({\mathbf{W}}^{(0)}, \cdots, {\mathbf{W}}^{(L-1)})$ and where

Figures (16)

  • Figure 1: Gradient similarity measure $\mu\left({\frac{\partial{\mathcal{L}_\mathsf{GNN}({\mathbf{W}})}}{\partial{{{\mathbf{X}}}^{({\ell})}}}}\right)$ over different layers and activation functions of $128$-layer GCN and GAT in three datasets: Cora, CiteSeer, and Chameleon.
  • Figure 2: Gradient similarity measure $\mu\left({\frac{\partial{\mathcal{L}_\mathsf{resGNN}({\mathbf{W}})}}{\partial{{{\mathbf{X}}}^{({\ell})}}}}\right)$ over different layers of $64$-layer GCN and GAT with residual connections in three datasets: Cora, CiteSeer, and Chameleon. The similarity measures with "NaN" value are not indicated in the plot.
  • Figure 3: Scatter plots between (a) representation similarity vs gradient similarity (b) test accuracy and representation similarity, (c) test accuracy and gradient similarity on the Cora dataset.
  • Figure 4: Gradient similarity measure $\mu\left({\frac{\partial{\mathcal{L}_\mathsf{GNN}({\mathbf{W}})}}{\partial{{{\mathbf{X}}}^{({\ell})}}}}\right)$ over training with 4-, 16-, 64- layered GCN and GAT. We report the similarity measures over two datasets: Cora and Chameleon. The dashed and solid lines represent the similarity measured at the start and end of the training, respectively. The shaded area represents the maximum and minimum similarities over training.
  • Figure 5: Gradient similarity measure $\mu\left({\frac{\partial{\mathcal{L}_\mathsf{resGNN}({\mathbf{W}})}}{\partial{{{\mathbf{X}}}^{({\ell})}}}}\right)$ over training with 4-, 16-, 64- layered resGCN and resGAT. The dashed and solid lines represent the similarity measured at the start and end of the training, respectively. The shaded area represents the maximum and minimum similarities over training. Residual connections with deep layers introduce gradient expansion.
  • ...and 11 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Theorem 1: Gradient oversmoothing in LGN
  • Lemma 2
  • Theorem 2: Gradient expansion in resLGN
  • proof
  • proof
  • Definition 1: Ergodicity
  • Definition 2: Joint Spectral Radius
  • proof
  • proof