Table of Contents
Fetching ...

Solving Oversmoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability

Weiqi Guan, Junlin He

TL;DR

The paper investigates how Layer Normalization placement (Pre-LN vs Post-LN) impacts oversmoothing and the curse of depth in attention-based GNNs. It develops a nonlocal message passing scheme with Post-LN that induces algebraic smoothing, providing formal guarantees and avoiding both oversmoothing and depth-related optimization issues. The approach is parameter-efficient and scales to 256 layers, with empirical validation across five benchmarks showing improved depth utilization and competitive accuracy. Together, the work connects diffusion dynamics to practical deep graph learning practice and highlights potential implications for Post-LN configurations in other domains such as LLMs.

Abstract

The relationship between Layer Normalization (LN) placement and the oversmoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid oversmoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience oversmoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing oversmoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters. Key contributions: Theoretical Characterization: Analysis of LN dynamics and their impact on oversmoothing and the curse of depth. A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids oversmoothing and the curse of depth. Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs.

Solving Oversmoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability

TL;DR

The paper investigates how Layer Normalization placement (Pre-LN vs Post-LN) impacts oversmoothing and the curse of depth in attention-based GNNs. It develops a nonlocal message passing scheme with Post-LN that induces algebraic smoothing, providing formal guarantees and avoiding both oversmoothing and depth-related optimization issues. The approach is parameter-efficient and scales to 256 layers, with empirical validation across five benchmarks showing improved depth utilization and competitive accuracy. Together, the work connects diffusion dynamics to practical deep graph learning practice and highlights potential implications for Post-LN configurations in other domains such as LLMs.

Abstract

The relationship between Layer Normalization (LN) placement and the oversmoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid oversmoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience oversmoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing oversmoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters. Key contributions: Theoretical Characterization: Analysis of LN dynamics and their impact on oversmoothing and the curse of depth. A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids oversmoothing and the curse of depth. Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs.

Paper Structure

This paper contains 24 sections, 4 theorems, 53 equations, 4 figures, 18 tables.

Key Result

Lemma 2.1

$G = (V,E,\omega,\mu)$ is a weighted graph. Then for any function X on the graph and all $m_1,m_2\in\mathbb{N}$, There exist universal constants $C_1, C_2\textgreater0$ such that

Figures (4)

  • Figure 1: Log-log plots of Laplacian energy evolution at initialization on Cora. The panels display (a) the Post-LN SAN, (b) our proposed nonlocal SAN with Post-LN, and (c) the Pre-LN SAN architecture.
  • Figure 2: Inter-layer cosine similarity of representations after training on Cora. (a) Our proposed nonlocal SAN with Post-LN. (b) The Pre-LN SAN.
  • Figure 3: (a) The Post-LN model. (b) Our proposed nonlocal model with Post-LN.
  • Figure 4: Evolution of Laplacian energy at initialization.

Theorems & Definitions (7)

  • Definition 2.1
  • Lemma 2.1
  • Definition 2.2: The Curse of Depth
  • Theorem 4.1
  • Theorem 5.1
  • Lemma A.1
  • proof