Table of Contents
Fetching ...

On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning

Álvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst

TL;DR

This paper unifies two long-standing issues in graph neural networks—over-smoothing and over-squashing—by framing GNN training dynamics through vanishing gradients and viewing GNNs as recurrent, stateful systems. It introduces GNN-SSM, a state-space model that fixes a unit-modulus Lambda and a controllable input matrix B to place the layer Jacobian at the edge of chaos, thereby mitigating gradient vanishing without extra trainable parameters. The authors theoretically link vanishing gradients to feature collapse via Lipschitz contraction and Dirichlet energy decay, and empirically validate their claims on standard benchmarks, showing that controlling the Jacobian spectrum yields deeper, more expressive GNNs. They further argue that overcoming over-squashing requires a combination of strong connectivity and non-dissipative dynamics, demonstrated via k-hop rewiring and the GNN-SSM backbone, achieving improved long-range propagation and performance on long-range graph tasks.

Abstract

Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.

On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning

TL;DR

This paper unifies two long-standing issues in graph neural networks—over-smoothing and over-squashing—by framing GNN training dynamics through vanishing gradients and viewing GNNs as recurrent, stateful systems. It introduces GNN-SSM, a state-space model that fixes a unit-modulus Lambda and a controllable input matrix B to place the layer Jacobian at the edge of chaos, thereby mitigating gradient vanishing without extra trainable parameters. The authors theoretically link vanishing gradients to feature collapse via Lipschitz contraction and Dirichlet energy decay, and empirically validate their claims on standard benchmarks, showing that controlling the Jacobian spectrum yields deeper, more expressive GNNs. They further argue that overcoming over-squashing requires a combination of strong connectivity and non-dissipative dynamics, demonstrated via k-hop rewiring and the GNN-SSM backbone, achieving improved long-range propagation and performance on long-range graph tasks.

Abstract

Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.

Paper Structure

This paper contains 48 sections, 15 theorems, 41 equations, 15 figures, 11 tables.

Key Result

Lemma 3.1

Let $\mathbf{H}^{(k)} \;=\; \tilde{\mathbf{A}}\;\mathbf{H}^{(k-1)}\;\mathbf{W}$ be a linear GCN layer, where $\tilde{\mathbf{A}}$ has eigenvalues $\{\lambda_1,\ldots,\lambda_n\}$ and $\mathbf{W}\,\mathbf{W}^T$ has eigenvalues $\{\mu_1,\ldots,\mu_{d_k}\}$. Consider the layer-wise Jacobian $\mathbf{J}

Figures (15)

  • Figure 1: Latent evolution of 2-dimensional node features when passing through layers of a GNN-SSM with $\mathrm{eig}(\Lambda)\approx 1$. Node states evolve in a norm-preserving manner, without collapsing or contracting. The blue lines indicate how each node feature evolves across layers, i.e., as more layers are added. Each circle corresponds to a node’s 2D feature. Circles connected by a blue line represent the same node across successive layers.The color of each circle encodes the norm of the node feature, and the vector field indicates direction.
  • Figure 2: Left: Histogram of eigenvalue modulus of the Jacobian for linear, linear convolutional, and nonlinear convolutional layers. Middle: Vectorized Jacobian for GCN. Right: Vectorized Jacobian for GCN-SSM with $\mathrm{eig}(\Lambda)\approx1$, $\mathrm{eig}(B)\approx0.1$.
  • Figure 3: Experimental evaluation on Cora for an increasing number of layers. Left: Dirichlet Energy evolution for different $||\Lambda||_2$. Middle: 2-Dimensional random feature projection evolution with a fixed point at zero. Right: Node classification performance.
  • Figure 4: Left: Evolution of Dirichlet Energy on the Cora dataset for GIN and Gated-GCN. Right: Histograms of eigenvalue spectra of layer-to-layer Jacobians for GIN and Gated-GCN.
  • Figure 5: Left: Performance on the RingTransfer task. Right: Effect of dissipativity.
  • ...and 10 more figures

Theorems & Definitions (25)

  • Lemma 3.1: Spectrum of layer-wise Jacobian's singular values
  • Theorem 3.2: Jacobian singular-value distribution
  • Proposition 3.3: Effect of state-space matrices
  • Lemma 4.1: Banach Fixed Point Theorem banach1922
  • Lemma 4.2
  • Proposition 4.3: Convergence to a unique fixed point.
  • Proposition 4.4: Contractions decrease Dirichlet energy.
  • Theorem 5.1: Sensitivity bounds, di2023over
  • Definition A.1: Vectorization and Kronecker product
  • Definition A.2: Wishart matrix
  • ...and 15 more