How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

Pirmin Lemberger; Antoine Saillenfest

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

Pirmin Lemberger, Antoine Saillenfest

TL;DR

The paper investigates how article content, graph structure, and label dependencies affect semi-supervised node classification on WikiVitals, a large document graph with $|V|=48{,}512$, $|E|=2{,}297{,}532$, and $K=32$ categories. Using Graph Markov Neural Networks (GMNN), the authors couple a mean-field GNN that predicts labels from content with a second GNN that captures label dependencies via an EM-based training regime; they also include a graph-agnostic baseline and a structure-only baseline for comparison. They adapt a rigorous fair evaluation framework to WikiVitals and classical datasets (Cora, Citeseer, Pubmed) to separate model selection from assessment across dense and sparse train splits. Results show that incorporating label dependencies yields statistically significant gains, especially under sparse training, while graph structure improves accuracy across datasets; disassortativity in WikiVitals makes FAGCN particularly effective. The work provides a robust evaluation platform for GMNNs on large real-world graphs and points to future directions in hierarchical and multi-label extensions.

Abstract

We introduce a new dataset named WikiVitals which contains a large graph of 48k mutually referred Wikipedia articles classified into 32 categories and connected by 2.3M edges. Our aim is to rigorously evaluate the contributions of three distinct sources of information to the label prediction in a semi-supervised node classification setting, namely the content of the articles, their connections with each other and the correlations among their labels. We perform this evaluation using a Graph Markov Neural Network which provides a theoretically principled model for this task and we conduct a detailed evaluation of the contributions of each sources of information using a clear separation of model selection and model assessment. One interesting observation is that including the effect of label dependencies is more relevant for sparse train sets than it is for dense train sets.

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

TL;DR

The paper investigates how article content, graph structure, and label dependencies affect semi-supervised node classification on WikiVitals, a large document graph with

, and

categories. Using Graph Markov Neural Networks (GMNN), the authors couple a mean-field GNN that predicts labels from content with a second GNN that captures label dependencies via an EM-based training regime; they also include a graph-agnostic baseline and a structure-only baseline for comparison. They adapt a rigorous fair evaluation framework to WikiVitals and classical datasets (Cora, Citeseer, Pubmed) to separate model selection from assessment across dense and sparse train splits. Results show that incorporating label dependencies yields statistically significant gains, especially under sparse training, while graph structure improves accuracy across datasets; disassortativity in WikiVitals makes FAGCN particularly effective. The work provides a robust evaluation platform for GMNNs on large real-world graphs and points to future directions in hierarchical and multi-label extensions.

Abstract

Paper Structure (16 sections, 6 equations, 1 figure, 4 tables)

This paper contains 16 sections, 6 equations, 1 figure, 4 tables.

Introduction
Related Work
Evaluating Performance of GNNs
Modelling Label Dependency in GNNs
Classifying Wikipedia Articles
Adapting the Fair Comparison Method to GMNN
Training GMNNs
Fair Comparison of GMNNs
Experiment
Creating WikiVitals
Datasets and Settings
Results
Contribution of the Graph Structure
Contribution of the Label Dependencies
Conclusion and Perspectives
...and 1 more sections

Figures (1)

Figure 1: The fair evaluation procedure for GNNs and its adaptation for GMNN uses $k$ train/validation/test splits $\mathcal{D}^{(i)}_{\mathrm{in-train}}, \mathcal{D}^{(i)}_{\mathrm{valid}}, \mathcal{D}^{(i)}_{\mathrm{test}}$ which are created from $k$ stratified folds $\mathcal{F}_i$.

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

TL;DR

Abstract

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (1)