Table of Contents
Fetching ...

Multi-label Node Classification On Graph-Structured Data

Tianqi Zhao, Ngan Thi Dong, Alan Hanjalic, Megha Khosla

TL;DR

This work tackles multi-label node classification on graph-structured data, highlighting the scarcity of public datasets and the distinct semantics of homophily in multi-label contexts. It provides three real-world biological datasets and a synthetic generator with tunable properties, plus a framework for analyzing homophily and Cross-Class Neighborhood Similarity (CCNS), across nine datasets and eight methods. Large-scale experiments reveal that simple baselines can outperform some GNNs on several datasets and that conventional AUROC evaluation can be misleading in sparse, multi-label settings, motivating the use of Average Precision. The authors publicly release a comprehensive benchmark to advance standardized evaluation in multi-label graph learning.

Abstract

Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected $9$ multi-label datasets. Finally, we perform a large-scale comparative study with $8$ methods and $9$ datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC.

Multi-label Node Classification On Graph-Structured Data

TL;DR

This work tackles multi-label node classification on graph-structured data, highlighting the scarcity of public datasets and the distinct semantics of homophily in multi-label contexts. It provides three real-world biological datasets and a synthetic generator with tunable properties, plus a framework for analyzing homophily and Cross-Class Neighborhood Similarity (CCNS), across nine datasets and eight methods. Large-scale experiments reveal that simple baselines can outperform some GNNs on several datasets and that conventional AUROC evaluation can be misleading in sparse, multi-label settings, motivating the use of Average Precision. The authors publicly release a comprehensive benchmark to advance standardized evaluation in multi-label graph learning.

Abstract

Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected multi-label datasets. Finally, we perform a large-scale comparative study with methods and datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC.
Paper Structure (28 sections, 5 equations, 16 figures, 10 tables)

This paper contains 28 sections, 5 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Label distributions. In BlogCat, the majority of the nodes have one label. In OGB-Proteins, around $41$% of total nodes have no labels, and only three nodes have the maximum number of $100$ labels.
  • Figure 2: Cross class Neighborhood Similarity in real-world datasets
  • Figure 3: Label distributions in biological datasets. The majority of the nodes in all datasets have one label.
  • Figure 4: Cross class Neighborhood Similarity in real-world datasets and proposed biological datasets
  • Figure 5: Cross-class Neighborhood Similarity in hypersphere datasets with varying label homophily
  • ...and 11 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2