Table of Contents
Fetching ...

A data-centric approach for assessing progress of Graph Neural Networks

Tianqi Zhao, Ngan Thi Dong, Alan Hanjalic, Megha Khosla

TL;DR

This work addresses the realism gap in evaluating multi-label node classification by adopting a data-centric lens that scrutinizes dataset quality and characteristics. It introduces a two-part generator framework and new metrics (including homophily and cross-class neighborhood similarity) to construct and analyze diverse multi-label graphs, complemented by three real-world biological datasets and a large-scale cross-method evaluation. The findings reveal that simple baselines can outperform many GNNs on several datasets and that existing methods often fail to generalize across varying label distributions and homophily, highlighting the need for more robust benchmarks and evaluation protocols. By providing datasets, a tunable generator, and rigorous analysis, the work aims to guide realistic progress in multi-label graph learning with practical impact on domains like protein function prediction.

Abstract

Graph Neural Networks (GNNs) have achieved state-of-the-art results in node classification tasks. However, most improvements are in multi-class classification, with less focus on the cases where each node could have multiple labels. The first challenge in studying multi-label node classification is the scarcity of publicly available datasets. To address this, we collected and released three real-world biological datasets and developed a multi-label graph generator with tunable properties. We also argue that traditional notions of homophily and heterophily do not apply well to multi-label scenarios. Therefore, we define homophily and Cross-Class Neighborhood Similarity for multi-label classification and investigate $9$ collected multi-label datasets. Lastly, we conducted a large-scale comparative study with $8$ methods across nine datasets to evaluate current progress in multi-label node classification. We release our code at \url{https://github.com/Tianqi-py/MLGNC}.

A data-centric approach for assessing progress of Graph Neural Networks

TL;DR

This work addresses the realism gap in evaluating multi-label node classification by adopting a data-centric lens that scrutinizes dataset quality and characteristics. It introduces a two-part generator framework and new metrics (including homophily and cross-class neighborhood similarity) to construct and analyze diverse multi-label graphs, complemented by three real-world biological datasets and a large-scale cross-method evaluation. The findings reveal that simple baselines can outperform many GNNs on several datasets and that existing methods often fail to generalize across varying label distributions and homophily, highlighting the need for more robust benchmarks and evaluation protocols. By providing datasets, a tunable generator, and rigorous analysis, the work aims to guide realistic progress in multi-label graph learning with practical impact on domains like protein function prediction.

Abstract

Graph Neural Networks (GNNs) have achieved state-of-the-art results in node classification tasks. However, most improvements are in multi-class classification, with less focus on the cases where each node could have multiple labels. The first challenge in studying multi-label node classification is the scarcity of publicly available datasets. To address this, we collected and released three real-world biological datasets and developed a multi-label graph generator with tunable properties. We also argue that traditional notions of homophily and heterophily do not apply well to multi-label scenarios. Therefore, we define homophily and Cross-Class Neighborhood Similarity for multi-label classification and investigate collected multi-label datasets. Lastly, we conducted a large-scale comparative study with methods across nine datasets to evaluate current progress in multi-label node classification. We release our code at \url{https://github.com/Tianqi-py/MLGNC}.
Paper Structure (9 sections, 1 figure, 2 tables)

This paper contains 9 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Cross class Neighborhood Similarity in dataset DBLP and BlogCat.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2