Table of Contents
Fetching ...

Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks

Dimitrios Kelesis, Dimitris Fotakis, Georgios Paliouras

TL;DR

The paper addresses oversmoothing in deep graph neural networks by introducing MASED, a Mean Average Squared Euclidean Distance that quantifies embedding dispersion across layers. It derives layer-wise and network-wide bounds linking MASED to embedding norms and the singular values of weight matrices, providing a principled explanation for depth-related collapse and suggesting targeted remedies. The authors propose G-Reg to increase the smallest singular value of weight matrices and advocate decoupling the number of trainable weight matrices from the total graph-hop count to reduce redundancy and oversmoothing. Empirical results across seven datasets show that MASED correlates with performance and that G-Reg, along with decoupled-hop architectures, enables deeper, more robust GNNs, including in cold-start scenarios. This work offers a concrete, quantitative framework for understanding and mitigating oversmoothing with practical guidelines for model design and regularization.

Abstract

In this paper, we study the factors that contribute to the effect of oversmoothing in deep Graph Neural Networks (GNNs). Specifically, our analysis is based on a new metric (Mean Average Squared Distance - $MASED$) to quantify the extent of oversmoothing. We derive layer-wise bounds on $MASED$, which aggregate to yield global upper and lower distance bounds. Based on this quantification of oversmoothing, we further analyze the importance of two different properties of the model; namely the norms of the generated node embeddings, along with the largest and smallest singular values of the weight matrices. Building on the insights drawn from the theoretical analysis, we show that oversmoothing increases as the number of trainable weight matrices and the number of adjacency matrices increases. We also use the derived layer-wise bounds on $MASED$ to form a proposal for decoupling the number of hops (i.e., adjacency depth) from the number of weight matrices. In particular, we introduce G-Reg, a regularization scheme that increases the bounds, and demonstrate through extensive experiments that by doing so node classification accuracy increases, achieving robustness at large depths. We further show that by reducing oversmoothing in deep networks, we can achieve better results in some tasks than using shallow ones. Specifically, we experiment with a ``cold start" scenario, i.e., when there is no feature information for the unlabeled nodes. Finally, we show empirically the trade-off between receptive field size (i.e., number of weight matrices) and performance, using the $MASED$ bounds. This is achieved by distributing adjacency hops across a small number of trainable layers, avoiding the extremes of under- or over-parameterization of the GNN.

Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks

TL;DR

The paper addresses oversmoothing in deep graph neural networks by introducing MASED, a Mean Average Squared Euclidean Distance that quantifies embedding dispersion across layers. It derives layer-wise and network-wide bounds linking MASED to embedding norms and the singular values of weight matrices, providing a principled explanation for depth-related collapse and suggesting targeted remedies. The authors propose G-Reg to increase the smallest singular value of weight matrices and advocate decoupling the number of trainable weight matrices from the total graph-hop count to reduce redundancy and oversmoothing. Empirical results across seven datasets show that MASED correlates with performance and that G-Reg, along with decoupled-hop architectures, enables deeper, more robust GNNs, including in cold-start scenarios. This work offers a concrete, quantitative framework for understanding and mitigating oversmoothing with practical guidelines for model design and regularization.

Abstract

In this paper, we study the factors that contribute to the effect of oversmoothing in deep Graph Neural Networks (GNNs). Specifically, our analysis is based on a new metric (Mean Average Squared Distance - ) to quantify the extent of oversmoothing. We derive layer-wise bounds on , which aggregate to yield global upper and lower distance bounds. Based on this quantification of oversmoothing, we further analyze the importance of two different properties of the model; namely the norms of the generated node embeddings, along with the largest and smallest singular values of the weight matrices. Building on the insights drawn from the theoretical analysis, we show that oversmoothing increases as the number of trainable weight matrices and the number of adjacency matrices increases. We also use the derived layer-wise bounds on to form a proposal for decoupling the number of hops (i.e., adjacency depth) from the number of weight matrices. In particular, we introduce G-Reg, a regularization scheme that increases the bounds, and demonstrate through extensive experiments that by doing so node classification accuracy increases, achieving robustness at large depths. We further show that by reducing oversmoothing in deep networks, we can achieve better results in some tasks than using shallow ones. Specifically, we experiment with a ``cold start" scenario, i.e., when there is no feature information for the unlabeled nodes. Finally, we show empirically the trade-off between receptive field size (i.e., number of weight matrices) and performance, using the bounds. This is achieved by distributing adjacency hops across a small number of trainable layers, avoiding the extremes of under- or over-parameterization of the GNN.

Paper Structure

This paper contains 28 sections, 3 theorems, 39 equations, 27 figures, 3 tables.

Key Result

Theorem 1

Let $s_l=\prod\limits_{h=1}^{L}{s_{lh}}$ where $s_{lh}$ is the largest singular value of weight matrix $W_{lh}$ and s = $sup_{l\in N^+} s_{l}$. Then the distance from the oversmoothing subspace $M$ is measured as follows: $d_M(X^{(l)}) = O((s\lambda)^l)$, where l is the layer number, $\lambda$ is th

Figures (27)

  • Figure 1: Each SGC layer uses the adjacency matrix raised to the power of $L/K$, and takes as input either the output of the previous layer or the initial node features if it is the first layer.
  • Figure 2: Epoch evolution of the Mean Average Squared Euclidean Distance ($MASED$) value of the embeddings of all nodes and training nodes separately. We show results for 3 different depths of a GCN model, illustrating how $MASED$ changes in the first, the middle and the last layer of the model. We also include the accuracy achieved by each model.
  • Figure 3: Epoch evolution of the average value of the norms of the embeddings of all nodes and of the training nodes separately. We show results for 3 different depths of a GCN model and average norm values in different layers within the model. We show how norms evolve in the first, the middle and the last layer of each model. We also include the accuracy achieved by each model.
  • Figure 4: Epoch evolution of the average value of the angles between the class centroids of the embeddings of the training nodes. We show results for 3 different depths of a GCN model and average norm values in different layers within the model. We show how angles evolve in the first, the middle and the last layer of each model. We also include the accuracy achieved by each model.
  • Figure 5: GCN with and without the proposed G-Reg regularization across 7 datasets for varying depth. We include results for different values of $\lambda_w$.
  • ...and 22 more figures

Theorems & Definitions (3)

  • Theorem 1: Suzuki
  • Lemma 2
  • Lemma 3