Residual connections provably mitigate oversmoothing in graph neural networks
Ziang Chen, Zhengjiang Lin, Shi Chen, Yury Polyanskiy, Philippe Rigollet
TL;DR
The paper tackles oversmoothing in deep graph neural networks by introducing a rigorous, MET-based framework to study the asymptotic separation of vertex features. It defines a normalized vertex similarity measure $\mu(x)$ and derives exact rates for non-residual and residual GNNs under broad spectral and distributional assumptions, including non-symmetric $P$ and i.i.d. weight ensembles. The main contributions are two theorems: (i) non-residual GNNs exhibit exponential decay of $\mu(x^{(t)})$ at a rate given by the second-largest eigenvalue of $P$, and (ii) residual GNNs admit a computable lower bound on the same rate, often strictly larger or even equal to 1, indicating mitigation or avoidance of oversmoothing; special cases (deterministic, Ginibre, bounded-norm, simultaneously diagonalizable) are treated explicitly. The findings are validated with numerical experiments on standard citation graphs, showing that residual connections preserve vertex distinctiveness and improve deep-model performance, thereby offering practical guidance for designing deeper GNNs with provable resilience to oversmoothing.
Abstract
Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as "oversmoothing" persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoothing rates of deep GNNs with and without residual connections by deriving explicit convergence rates for a normalized vertex similarity measure. Our analytical framework is grounded in the multiplicative ergodic theorem. Furthermore, we demonstrate that adding residual connections effectively mitigates or prevents oversmoothing across several broad families of parameter distributions. The theoretical findings are strongly supported by numerical experiments.
