Table of Contents
Fetching ...

Beyond Over-smoothing: Uncovering the Trainability Challenges in Deep Graph Neural Networks

Jie Peng, Runlin Lei, Zhewei Wei

TL;DR

It is theoretically prove that the difficult training problem of deep MLPs is actually the main challenge, and various existing methods that supposedly tackle Over-smoothing actually improve the trainability of MLPs, which is the main reason for their performance gains.

Abstract

The drastic performance degradation of Graph Neural Networks (GNNs) as the depth of the graph propagation layers exceeds 8-10 is widely attributed to a phenomenon of Over-smoothing. Although recent research suggests that Over-smoothing may not be the dominant reason for such a performance degradation, they have not provided rigorous analysis from a theoretical view, which warrants further investigation. In this paper, we systematically analyze the real dominant problem in deep GNNs and identify the issues that these GNNs towards addressing Over-smoothing essentially work on via empirical experiments and theoretical gradient analysis. We theoretically prove that the difficult training problem of deep MLPs is actually the main challenge, and various existing methods that supposedly tackle Over-smoothing actually improve the trainability of MLPs, which is the main reason for their performance gains. Our further investigation into trainability issues reveals that properly constrained smaller upper bounds of gradient flow notably enhance the trainability of GNNs. Experimental results on diverse datasets demonstrate consistency between our theoretical findings and empirical evidence. Our analysis provides new insights in constructing deep graph models.

Beyond Over-smoothing: Uncovering the Trainability Challenges in Deep Graph Neural Networks

TL;DR

It is theoretically prove that the difficult training problem of deep MLPs is actually the main challenge, and various existing methods that supposedly tackle Over-smoothing actually improve the trainability of MLPs, which is the main reason for their performance gains.

Abstract

The drastic performance degradation of Graph Neural Networks (GNNs) as the depth of the graph propagation layers exceeds 8-10 is widely attributed to a phenomenon of Over-smoothing. Although recent research suggests that Over-smoothing may not be the dominant reason for such a performance degradation, they have not provided rigorous analysis from a theoretical view, which warrants further investigation. In this paper, we systematically analyze the real dominant problem in deep GNNs and identify the issues that these GNNs towards addressing Over-smoothing essentially work on via empirical experiments and theoretical gradient analysis. We theoretically prove that the difficult training problem of deep MLPs is actually the main challenge, and various existing methods that supposedly tackle Over-smoothing actually improve the trainability of MLPs, which is the main reason for their performance gains. Our further investigation into trainability issues reveals that properly constrained smaller upper bounds of gradient flow notably enhance the trainability of GNNs. Experimental results on diverse datasets demonstrate consistency between our theoretical findings and empirical evidence. Our analysis provides new insights in constructing deep graph models.
Paper Structure (19 sections, 1 theorem, 19 equations, 7 figures, 3 tables)

This paper contains 19 sections, 1 theorem, 19 equations, 7 figures, 3 tables.

Key Result

proposition 1

The node-wise gradient of GCN with regards to any learnable weight parameter ${\mathbf{W}}^{(\ell)}_k$, for $1\leq k\leq v$, $1\leq\ell\leq N$ is bounded as (proved in the app:proof1):

Figures (7)

  • Figure 1: The test accuracy ($\%$) (a) and Dirichlet Energy (b) of node classification task of SGC ($K$= 2, 3, 4, ... ) with incremental MLP layers ($L$= 0, 2, 8, 16, 32, 64) on Cora.
  • Figure 2: Illustration of decoupled experiments (above). The decoupled experiments investigate the actual effectiveness of the tricks of GCNII, Batch Normalization, and DropEdge by adding them separately to the graph propagation process or the training process. Results of decoupled experiments (below). The contrast of color shades with vanilla SGC-MLP on the left reflects a decrease or increase in the accuracy ($\%$) of node classification tasks on Cora after using various tricks in each process.
  • Figure 3: The test accuracy ($\%$) of ResGCN, GCN, GCNII, and $\omega$GCN with increased layer depth on Cora.
  • Figure 4: Gradient flow variation and loss in 8-layer and 64-layer ResGCN (a), GCN (b), and GCNII (c) trained on Cora (up) and Citeseer (down). Blue lines and Green lines denote the gradient flow of the 8-layer GNNs and that of the 64-layer GNNs, respectively. Black lines denote the loss variation during training. The corresponding test accuracy ($\%$) is given in the subtitle.
  • Figure 5: Gradient flow variation and loss in 8-layer and 64-layer ResGCN, GCN, and GCNII trained on Pubmed.
  • ...and 2 more figures

Theorems & Definitions (1)

  • proposition 1