Table of Contents
Fetching ...

On the Impact of Feature Heterophily on Link Prediction with Graph Neural Networks

Jiong Zhu, Gaotang Li, Yao-An Yang, Jing Zhu, Xuehao Cui, Danai Koutra

TL;DR

This work focuses on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance, and introduces formal definitions of homophilic and heterophilic link prediction tasks and a theoretical framework that highlights the different optimizations needed for the respective tasks.

Abstract

Heterophily, or the tendency of connected nodes in networks to have different class labels or dissimilar features, has been identified as challenging for many Graph Neural Network (GNN) models. While the challenges of applying GNNs for node classification when class labels display strong heterophily are well understood, it is unclear how heterophily affects GNN performance in other important graph learning tasks where class labels are not available. In this work, we focus on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance. Theoretically, we first introduce formal definitions of homophilic and heterophilic link prediction tasks, and present a theoretical framework that highlights the different optimizations needed for the respective tasks. We then analyze how different link prediction encoders and decoders adapt to varying levels of feature homophily and introduce designs for improved performance. Our empirical analysis on a variety of synthetic and real-world datasets confirms our theoretical insights and highlights the importance of adopting learnable decoders and GNN encoders with ego- and neighbor-embedding separation in message passing for link prediction tasks beyond homophily.

On the Impact of Feature Heterophily on Link Prediction with Graph Neural Networks

TL;DR

This work focuses on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance, and introduces formal definitions of homophilic and heterophilic link prediction tasks and a theoretical framework that highlights the different optimizations needed for the respective tasks.

Abstract

Heterophily, or the tendency of connected nodes in networks to have different class labels or dissimilar features, has been identified as challenging for many Graph Neural Network (GNN) models. While the challenges of applying GNNs for node classification when class labels display strong heterophily are well understood, it is unclear how heterophily affects GNN performance in other important graph learning tasks where class labels are not available. In this work, we focus on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance. Theoretically, we first introduce formal definitions of homophilic and heterophilic link prediction tasks, and present a theoretical framework that highlights the different optimizations needed for the respective tasks. We then analyze how different link prediction encoders and decoders adapt to varying levels of feature homophily and introduce designs for improved performance. Our empirical analysis on a variety of synthetic and real-world datasets confirms our theoretical insights and highlights the importance of adopting learnable decoders and GNN encoders with ego- and neighbor-embedding separation in message passing for link prediction tasks beyond homophily.
Paper Structure (32 sections, 3 theorems, 12 equations, 10 figures, 4 tables)

This paper contains 32 sections, 3 theorems, 12 equations, 10 figures, 4 tables.

Key Result

Theorem 1

Following the above assumptions, consider two DistMult decoders that are fully optimized for homophilic and heterophilic link prediction problems respectively. Give an arbitrary node pair $(u', v')$ with node features $\mathbf{x}_{u'} = (\cos \theta_{u'}, \sin \theta_{u'})$ and $\mathbf{x}_{v'} = (\

Figures (10)

  • Figure 1: Categorizing link prediction tasks based on the distribution of feature similarity scores of positive node pairs (i.e., edges -- colored in green) and negative node pairs (non-edges -- colored in red): two distributions whose density is visualized in the plots are (approximately) separated by the threshold(s) $M$. Homophilic and heterophilic link prediction differs in whether the positive similarity scores fall into the larger or smaller side of the threshold $M$, while the magnitude of $M$ indicates the variance of positive similarity. Gated link prediction is a more complex case where the distribution of positive and negative similarity scores cannot be separated by a single threshold.
  • Figure 2: Link prediction scores $\hat{y}_{u'v'}$ for decoders optimized respectively under homophilic and heterophilic setups in Thm. \ref{['thm:hete-homo-sim-scores']}, assuming $M=0.5$.
  • Figure 3: Comparing link prediction methods on synthetic graphs with varying levels of feature similarity: (\ref{['fig:synthetic-decoder-fixed-sage']}) and (\ref{['fig:synthetic-decoder-fixed-gcn']}) focus on decoders, while (\ref{['fig:synthetic-encoder-fixed-mlp']}) focuses on encoders. We include MLP decoder without GNN as a graph agnostic baseline in all plots. Numerical results are reported in Table \ref{['tab:synthetic-results']}.
  • Figure 4: Pairwise comparison of encoder or decoder choices on test edges grouped by node degrees (x-axis) and feature similarity (y-axis): Green denotes MRR increases and purple denotes decreases. More plots in Fig. \ref{['fig:per-edge-analysis-count']}-\ref{['fig:per-edge-analysis-esci']}.
  • Figure 5: Comparison of feature similarity distributions for edges and random node pairs on real-world datasets used in our experiments. For similarity scores of random node pairs, we randomly sample 1000 nodes and compute the pairwise cosine similarity between these node features. Similarity score distributions for random node pairs are good approximations of the distributions for non-edge node pairs due to the sparsity of the graphs.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Definition 1: Node Feature Similarity
  • Definition 2: Graph Feature Similarity
  • Definition 3: Homophilic Link Prediction
  • Definition 4: Heterophilic Link Prediction
  • Definition 5: Gated Link Prediction
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proof 1
  • Proof 2
  • ...and 1 more