Graph Neural Networks Do Not Always Oversmooth

Bastian Epping; Alexandre René; Moritz Helias; Michael T. Schaub

Graph Neural Networks Do Not Always Oversmooth

Bastian Epping, Alexandre René, Moritz Helias, Michael T. Schaub

TL;DR

This paper identifies a non-oversmoothing phase for Graph Convolutional Networks by leveraging Gaussian-process (GP) equivalence in the limit of infinite hidden features. By linearizing the GP dynamics around the fixed point and performing an eigen-direction analysis, it shows that deep GCNs can avoid oversmoothing if the weight variance $oldsymbol{ u_w^2}$ is large enough, with a transition at $ ext{max}_i|oldsymbol{ extlambda_i^{(p)}}|=1$ that yields a diverging propagation depth $oldsymbol{ extxi_i}$. The authors verify predictions on toy complete graphs and a Contextual Stochastic Block Model (CSBM), and demonstrate near-transition and chaotic-phase networks remain informative for deep depths, including on the Cora dataset where GP-based results match established benchmarks for hundreds of layers. The findings offer a principled initialization strategy to build exceptionally deep GCNs and provide insights into how graph topology is encoded in the equilibrium GP state, potentially guiding future GNN design and training.

Abstract

Graph neural networks (GNNs) have emerged as powerful tools for processing relational data in applications. However, GNNs suffer from the problem of oversmoothing, the property that the features of all nodes exponentially converge to the same vector over layers, prohibiting the design of deep GNNs. In this work we study oversmoothing in graph convolutional networks (GCNs) by using their Gaussian process (GP) equivalence in the limit of infinitely many hidden features. By generalizing methods from conventional deep neural networks (DNNs), we can describe the distribution of features at the output layer of deep GCNs in terms of a GP: as expected, we find that typical parameter choices from the literature lead to oversmoothing. The theory, however, allows us to identify a new, non-oversmoothing phase: if the initial weights of the network have sufficiently large variance, GCNs do not oversmooth, and node features remain informative even at large depth. We demonstrate the validity of this prediction in finite-size GCNs by training a linear classifier on their output. Moreover, using the linearization of the GCN GP, we generalize the concept of propagation depth of information from DNNs to GCNs. This propagation depth diverges at the transition between the oversmoothing and non-oversmoothing phase. We test the predictions of our approach and find good agreement with finite-size GCNs. Initializing GCNs near the transition to the non-oversmoothing phase, we obtain networks which are both deep and expressive.

Graph Neural Networks Do Not Always Oversmooth

TL;DR

is large enough, with a transition at

that yields a diverging propagation depth

. The authors verify predictions on toy complete graphs and a Contextual Stochastic Block Model (CSBM), and demonstrate near-transition and chaotic-phase networks remain informative for deep depths, including on the Cora dataset where GP-based results match established benchmarks for hundreds of layers. The findings offer a principled initialization strategy to build exceptionally deep GCNs and provide insights into how graph topology is encoded in the equilibrium GP state, potentially guiding future GNN design and training.

Abstract

Paper Structure (21 sections, 36 equations, 7 figures)

This paper contains 21 sections, 36 equations, 7 figures.

Introduction
Related Work
Background
Network architecture
Gaussian process equivalence of GCNs
Feature distance
Results
Propagation depths
The non-oversmoothing phase of GCNs
Complete graph
General graphs
Implications for performance
Discussion
Analytical solution for expectation values
The linearized GP of GCNs
...and 6 more sections

Figures (7)

Figure 1: Simulations and GP prior of a GCN on a complete graph with $N=5$ nodes, shift operator $A_{\alpha\beta}=\frac{g}{N-1}+\delta_{\alpha\beta}(1-\frac{Ng}{N-1})$, vanishing bias $\sigma_{b}^{2}=0$ and $\phi(x)=\mathrm{erf}(\frac{\sqrt{\pi}}{2}x)$. $\textbf{a)}$ The phase diagram dependent on $\sigma_{w}^{2}$ and $g$. The equilibrium feature distance $\mu(\boldsymbol{X})$ obtained from computing the GCN GP prior for $L=4,000$ layers is shown as a heatmap, the red line is the theoretical prediction for the transition to the non-oversmoothing phase. $\textbf{b)}$ Same as in a) but color coding shows whether $\mu(\boldsymbol{X})$ is close to zero (black) or not (white) with precision $10^{-5}$. The red line again shows the theoretically predicted phase transition. $\textbf{c)}$ Feature distance $\mu(\boldsymbol{X}^{(l)})$ for a random input $X_{\alpha i}^{(0)}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$ as a function of layer $l$. Parameters are written in the panel in matching colors and marked with color coded crosses in the phase diagram in panel b). Feature dimension of the hidden layers is $d_{l}=200$, crosses show the mean of $50$ network realizations, solid curves the theoretical predictions.
Figure 2: The non-oversmoothing phase in a contextual stochastic block model instance with parameters $N=100$, $d=5$, $\lambda=1$. The shift operator is chosen according to (\ref{['eq:shift-operator-definition']}) with $g=0.3$, and $\sigma_{b}^{2}=0$ and $\phi(x)=\mathrm{erf}(\frac{\sqrt{\pi}}{2}x)$. $\textbf{a)}$ The maximum feature distance between any pair of nodes in equilibrium obtained from computing the GCN GP prior for $L=4,000$ layers (blue) and the largest eigenvalue of the linearized GCN GP dynamics at the zero distance state as a function of weight variance $\sigma_{w}^{2}$. The red line marks the point where $\max_{i}\{|\lambda_{i}^{\mathrm{p}}|\}=1$. $\textbf{b)}$ Heatmap of the equilibrium distance matrix with entries $d_{\alpha\beta}=d(\boldsymbol{x}_{\alpha},\boldsymbol{x}_{\beta})$ (Equation (\ref{['eq:2-distance-definition']})) at $\sigma_{w}^{2}=1.3$, marked as point $A$ in panel a). Colorbar shared with the plot in c). $\textbf{c)}$ Same as b) but at point $B$ with $\sigma_{w}^{2}=2$. $\textbf{d)}$ Features distances $d_{\alpha\beta}^{(l)}=d(\boldsymbol{x}_{\alpha}^{(l)},\boldsymbol{x}_{\beta}^{(l)})$ as a function of layers for random inputs $X_{\alpha i}^{(0)}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$ and a finite-size GCN with $d_{l}=200$, averaged for distances for pairs of nodes within the same community (red) and across communities (purple).
Figure 3: Generalization error (mean squared error) of the Gaussian process for a CSBM with parameters $N=20$, $d=5$, $\lambda=1$, $\gamma=1$ and $\mu=4$. The shift operator is defined in (\ref{['eq:shift-operator-definition']}) with $g=0.1$, other parameters are $\sigma_{b}^{2}=0$, $\phi(x)=\mathrm{erf}(\frac{\sqrt{\pi}}{2}x)$ and $\sigma_{ro}=0.01$. In all panels we use $N^{\mathrm{train}}=10$ training nodes and $N^{\mathrm{test}}=10$ test nodes, five training nodes from each of the two communities. Labels are $\pm1$ for the two communities, respectively. For all panels, we show averages over $50$ CSBM instances. $\textbf{a)}$ Heatmap of the generalization error of the GCN GP dependent on number of layers $L$ and weight variance $\sigma_{w}^{2}$. The red line shows the transition to the non-oversmoothing phase. $\textbf{b)}$ Generalization error dependent on weight variance $\sigma_{w}^{2}$ and depths $L=1,4,16,64,256,1024$ from turquoise to dark blue. $\textbf{c)}$ Generalization error dependent on the layer for the GCN GP at the critical line $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}$, in the oversmoothing phase $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}-1$ and the non-oversmoothing phase $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}+1$. $\textbf{d)}$ Performance of randomly initialized finite-size GCNs with $d_{l}=200$ for $l=1,\dots,L$ where only the linear readout layer is trained with gradient descent (details in Appendix \ref{['app:Numerical-experiments']}) at the critical line $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}$, in the oversmoothing phase $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}-1$ and the non-oversmoothing phase $\sigma_{w}^{2}=\sigma_{w,\mathrm{crit}}^{2}+1$.
Figure 4: GCN GP performance on the Cora datset sen_collective_2008. $\textbf{a)}$ Generalization error (mean squared error) as a function of layers $L$ and weight variance $\sigma_{w}^{2}-\sigma_{w,\text{crit.}}^{2}$ for our stochastic shift operator (\ref{['eq:shift-operator-definition']}) with $g=0.9$. The value of $\sigma_{w,\text{crit.}}^{2}\approx1$ is determined numerically in Appendix \ref{['app:Non-oversmoothing-transition-cora']}. $\textbf{b)}$ Layer dependent generalization error and accuracy for GCNs near the transition $\sigma_{w}^{2}=\sigma_{w,\text{crit.}}^{2}+0.1$. Grey dashed line shows accuracy obtained for GCNs in the original work kipf_semi-supervised_2017. Numerical details in Appendix \ref{['app:Numerical-experiments']}.
Figure 5: Histogram of $\sigma_{w,\mathrm{crit}}^{2}$ for the $50$ CSBM instances used in the experiment of Figure \ref{['fig:Generalization-error-of']}. The point $1$ is marked for comparison to related work.
...and 2 more figures

Graph Neural Networks Do Not Always Oversmooth

TL;DR

Abstract

Graph Neural Networks Do Not Always Oversmooth

Authors

TL;DR

Abstract

Table of Contents

Figures (7)