Table of Contents
Fetching ...

Feature Distribution on Graph Topology Mediates the Effect of Graph Convolution: Homophily Perspective

Soo Yong Lee, Sunwoo Kim, Fanchen Bu, Jaemin Yoo, Jiliang Tang, Kijung Shin

TL;DR

This work reveals that the dependence between graph topology and node features, quantified by Class-controlled Feature Homophily (CFH), mediates the impact of graph convolution on GNN performance. It introduces CSBM-X, a contextual stochastic block model with a tunable $A\text{-}X$ dependence strength parameter $\tau$, to precisely control CFH while holding feature distance and class-homophily fixed. Theoretical results show that the Bayes error after graph convolution is minimized when CFH is zero (i.e., $\tau=0$), and empirical studies on synthetic CSBM-X graphs corroborate this, with real-world data showing that reducing $A\text{-}X$ dependence via feature shuffles improves GNN accuracy, particularly in high-homophily graphs. The findings suggest that small CFH is beneficial for node classification, offering a new lens on GNN design and evaluation, and highlight potential directions for tailoring datasets and architectures to leverage or resist topology-feature coupling. Overall, CFH provides a principled predictor of when graph convolution will be advantageous and how to modulate its effects in practice.

Abstract

How would randomly shuffling feature vectors among nodes from the same class affect graph neural networks (GNNs)? The feature shuffle, intuitively, perturbs the dependence between graph topology and features (A-X dependence) for GNNs to learn from. Surprisingly, we observe a consistent and significant improvement in GNN performance following the feature shuffle. Having overlooked the impact of A-X dependence on GNNs, the prior literature does not provide a satisfactory understanding of the phenomenon. Thus, we raise two research questions. First, how should A-X dependence be measured, while controlling for potential confounds? Second, how does A-X dependence affect GNNs? In response, we (i) propose a principled measure for A-X dependence, (ii) design a random graph model that controls A-X dependence, (iii) establish a theory on how A-X dependence relates to graph convolution, and (iv) present empirical analysis on real-world graphs that align with the theory. We conclude that A-X dependence mediates the effect of graph convolution, such that smaller dependence improves GNN-based node classification.

Feature Distribution on Graph Topology Mediates the Effect of Graph Convolution: Homophily Perspective

TL;DR

This work reveals that the dependence between graph topology and node features, quantified by Class-controlled Feature Homophily (CFH), mediates the impact of graph convolution on GNN performance. It introduces CSBM-X, a contextual stochastic block model with a tunable dependence strength parameter , to precisely control CFH while holding feature distance and class-homophily fixed. Theoretical results show that the Bayes error after graph convolution is minimized when CFH is zero (i.e., ), and empirical studies on synthetic CSBM-X graphs corroborate this, with real-world data showing that reducing dependence via feature shuffles improves GNN accuracy, particularly in high-homophily graphs. The findings suggest that small CFH is beneficial for node classification, offering a new lens on GNN design and evaluation, and highlight potential directions for tailoring datasets and architectures to leverage or resist topology-feature coupling. Overall, CFH provides a principled predictor of when graph convolution will be advantageous and how to modulate its effects in practice.

Abstract

How would randomly shuffling feature vectors among nodes from the same class affect graph neural networks (GNNs)? The feature shuffle, intuitively, perturbs the dependence between graph topology and features (A-X dependence) for GNNs to learn from. Surprisingly, we observe a consistent and significant improvement in GNN performance following the feature shuffle. Having overlooked the impact of A-X dependence on GNNs, the prior literature does not provide a satisfactory understanding of the phenomenon. Thus, we raise two research questions. First, how should A-X dependence be measured, while controlling for potential confounds? Second, how does A-X dependence affect GNNs? In response, we (i) propose a principled measure for A-X dependence, (ii) design a random graph model that controls A-X dependence, (iii) establish a theory on how A-X dependence relates to graph convolution, and (iv) present empirical analysis on real-world graphs that align with the theory. We conclude that A-X dependence mediates the effect of graph convolution, such that smaller dependence improves GNN-based node classification.
Paper Structure (42 sections, 6 theorems, 44 equations, 17 figures, 3 tables)

This paper contains 42 sections, 6 theorems, 44 equations, 17 figures, 3 tables.

Key Result

Lemma 3.1

(Boundedness) $\mathbf{\Tilde{h}}^{(G)}$, $\mathbf{\Tilde{h}}^{(v)}_i$$\in [-1,1]$, and the bound is tight, i.e., $\inf_{G}\tilde{\mathbf{h}}^{(G)} = -1$ and $\sup_{G}\mathbf{\Tilde{h}}^{(G)}\xspace = 1$.

Figures (17)

  • Figure 1: An Intriguing Phenomenon. GCN performance increases significantly over the feature shuffle, while those of MLP and label propagation remain stationary.
  • Figure 2: Benchmark Graph Statistics. Graph-level CFH scores $\mathbf{\Tilde{h}}^{(G)}$ (i) are generally positive and small, with (ii) low correlation to class-homophily $\mathbf{h}_c$.
  • Figure 3: The Effect of Feature Shuffle on CFH. Both graph- and node-level CFH scores, $\mathbf{\Tilde{h}}^{(G)}$ and $\mathbf{\Tilde{h}}^{(v)}_i$, tend to approach zero over the feature shuffles.
  • Figure 4: Visual Intuition of Theorem \ref{['thm:main_result']}. When CFH is low ($\mathbf{\Tilde{h}}^{(G)} \approx 0$), the feature distribution of each class shrinks faster (denoted by the arrows) by graph convolution, resulting in a lower Bayes error rate. Namely, the power to pull node features towards the feature mean of each class becomes stronger with decreasing $\vert \mathbf{\Tilde{h}}^{(G)} \vert$.
  • Figure 5: The Simplified GNN Performance in CSBM-X Graphs. Consistent with Theorem \ref{['thm:main_result']}, for given feature distance $\text{FD} > 0$ and class homophily $\mathbf{h}_c >0$, the simplified GNN performance increases as graph-level CFH $\mathbf{\Tilde{h}}^{(G)} \rightarrow 0~(\text{i.e.,~} \tau \rightarrow 0)$.
  • ...and 12 more figures

Theorems & Definitions (16)

  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Lemma 4.1: $\tau$ controls CFH $\mathbf{h}(\cdot)$ precisely
  • Lemma 4.2: $\tau$ controls CFH $\mathbf{h}(\cdot)$ only
  • Theorem 4.3
  • proof : Proof sketch
  • proof
  • proof
  • proof
  • ...and 6 more