Table of Contents
Fetching ...

Re-evaluating the Advancements of Heterophilic Graph Learning

Sitao Luan, Qincheng Lu, Chenqing Hua, Xinyu Wang, Jiaqi Zhu, Xiao-Wen Chang

TL;DR

This work re-evaluates heterophilic graph learning through a comprehensive, hyperparameter-tuned study on 27 benchmark datasets, revealing malignant and ambiguous heterophily as the truly challenging cases and benign heterophily as pseudo-challenges. It re-assesses 11 SOTA GNNs across dataset categories and demonstrates that many methods do not consistently outperform strong baselines, with scalability issues evident for several heterophily-focused models. In addition, the paper provides a first quantitative, multi-method evaluation of 11 homophily metrics on synthetic graphs generated by RG, PA, and GenCat, using Pearson and Fréchet distances to assess metric reliability and proposing that old metrics often outperform newer proposals. The findings advocate for principled hyperparameter tuning and diverse, challenging evaluation settings to fairly assess heterophily remedies and metric usefulness. Overall, the work offers a nuanced view of heterophily, emphasizes realistic benchmarks, and lays groundwork for more reliable evaluation frameworks in graph representation learning.

Abstract

Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data. However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs, and various homophily metrics have been designed to help recognize these challenging datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics: 1) lack of hyperparameter tuning; 2) insufficient evaluation on the truly challenging heterophilic datasets; 3) missing quantitative evaluation for homophily metrics on synthetic graphs. To overcome these challenges, we first train and fine-tune baseline models on $27$ most widely used benchmark datasets, and categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets. We identify malignant and ambiguous heterophily as the truly challenging subsets of tasks, and to our best knowledge, we are the first to propose such taxonomy. Then, we re-evaluate $11$ state-of-the-arts (SOTA) GNNs, covering six popular methods, with fine-tuned hyperparameters on different groups of heterophilic datasets. Based on the model performance, we comprehensively reassess the effectiveness of different methods on heterophily. At last, we evaluate $11$ popular homophily metrics on synthetic graphs with three different graph generation approaches. To overcome the unreliability of observation-based comparison and evaluation, we conduct the first quantitative evaluation and provide detailed analysis.

Re-evaluating the Advancements of Heterophilic Graph Learning

TL;DR

This work re-evaluates heterophilic graph learning through a comprehensive, hyperparameter-tuned study on 27 benchmark datasets, revealing malignant and ambiguous heterophily as the truly challenging cases and benign heterophily as pseudo-challenges. It re-assesses 11 SOTA GNNs across dataset categories and demonstrates that many methods do not consistently outperform strong baselines, with scalability issues evident for several heterophily-focused models. In addition, the paper provides a first quantitative, multi-method evaluation of 11 homophily metrics on synthetic graphs generated by RG, PA, and GenCat, using Pearson and Fréchet distances to assess metric reliability and proposing that old metrics often outperform newer proposals. The findings advocate for principled hyperparameter tuning and diverse, challenging evaluation settings to fairly assess heterophily remedies and metric usefulness. Overall, the work offers a nuanced view of heterophily, emphasizes realistic benchmarks, and lays groundwork for more reliable evaluation frameworks in graph representation learning.

Abstract

Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data. However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs, and various homophily metrics have been designed to help recognize these challenging datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics: 1) lack of hyperparameter tuning; 2) insufficient evaluation on the truly challenging heterophilic datasets; 3) missing quantitative evaluation for homophily metrics on synthetic graphs. To overcome these challenges, we first train and fine-tune baseline models on most widely used benchmark datasets, and categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets. We identify malignant and ambiguous heterophily as the truly challenging subsets of tasks, and to our best knowledge, we are the first to propose such taxonomy. Then, we re-evaluate state-of-the-arts (SOTA) GNNs, covering six popular methods, with fine-tuned hyperparameters on different groups of heterophilic datasets. Based on the model performance, we comprehensively reassess the effectiveness of different methods on heterophily. At last, we evaluate popular homophily metrics on synthetic graphs with three different graph generation approaches. To overcome the unreliability of observation-based comparison and evaluation, we conduct the first quantitative evaluation and provide detailed analysis.
Paper Structure (32 sections, 12 equations, 1 figure, 4 tables)

This paper contains 32 sections, 12 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Comparison of metrics on synthetic graphs with different generation methods. Note that $\text{H}_{\text{node}}$ overlaps with $\text{H}_{\text{edge}}$ in Figure (e) and (f). In Figure (e), $\text{H}_{\text{class}}(\mathcal{G})$ overlaps with $\text{H}_{\text{adj}}(\mathcal{G})$. $\text{KR}_\text{L}$, $\text{KR}_\text{NL}$ and GNB overlaps in Figure (d).