Table of Contents
Fetching ...

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

Pirmin Lemberger, Antoine Saillenfest

TL;DR

The paper investigates how article content, graph structure, and label dependencies affect semi-supervised node classification on WikiVitals, a large document graph with $|V|=48{,}512$, $|E|=2{,}297{,}532$, and $K=32$ categories. Using Graph Markov Neural Networks (GMNN), the authors couple a mean-field GNN that predicts labels from content with a second GNN that captures label dependencies via an EM-based training regime; they also include a graph-agnostic baseline and a structure-only baseline for comparison. They adapt a rigorous fair evaluation framework to WikiVitals and classical datasets (Cora, Citeseer, Pubmed) to separate model selection from assessment across dense and sparse train splits. Results show that incorporating label dependencies yields statistically significant gains, especially under sparse training, while graph structure improves accuracy across datasets; disassortativity in WikiVitals makes FAGCN particularly effective. The work provides a robust evaluation platform for GMNNs on large real-world graphs and points to future directions in hierarchical and multi-label extensions.

Abstract

We introduce a new dataset named WikiVitals which contains a large graph of 48k mutually referred Wikipedia articles classified into 32 categories and connected by 2.3M edges. Our aim is to rigorously evaluate the contributions of three distinct sources of information to the label prediction in a semi-supervised node classification setting, namely the content of the articles, their connections with each other and the correlations among their labels. We perform this evaluation using a Graph Markov Neural Network which provides a theoretically principled model for this task and we conduct a detailed evaluation of the contributions of each sources of information using a clear separation of model selection and model assessment. One interesting observation is that including the effect of label dependencies is more relevant for sparse train sets than it is for dense train sets.

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

TL;DR

The paper investigates how article content, graph structure, and label dependencies affect semi-supervised node classification on WikiVitals, a large document graph with , , and categories. Using Graph Markov Neural Networks (GMNN), the authors couple a mean-field GNN that predicts labels from content with a second GNN that captures label dependencies via an EM-based training regime; they also include a graph-agnostic baseline and a structure-only baseline for comparison. They adapt a rigorous fair evaluation framework to WikiVitals and classical datasets (Cora, Citeseer, Pubmed) to separate model selection from assessment across dense and sparse train splits. Results show that incorporating label dependencies yields statistically significant gains, especially under sparse training, while graph structure improves accuracy across datasets; disassortativity in WikiVitals makes FAGCN particularly effective. The work provides a robust evaluation platform for GMNNs on large real-world graphs and points to future directions in hierarchical and multi-label extensions.

Abstract

We introduce a new dataset named WikiVitals which contains a large graph of 48k mutually referred Wikipedia articles classified into 32 categories and connected by 2.3M edges. Our aim is to rigorously evaluate the contributions of three distinct sources of information to the label prediction in a semi-supervised node classification setting, namely the content of the articles, their connections with each other and the correlations among their labels. We perform this evaluation using a Graph Markov Neural Network which provides a theoretically principled model for this task and we conduct a detailed evaluation of the contributions of each sources of information using a clear separation of model selection and model assessment. One interesting observation is that including the effect of label dependencies is more relevant for sparse train sets than it is for dense train sets.
Paper Structure (16 sections, 6 equations, 1 figure, 4 tables)

This paper contains 16 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The fair evaluation procedure for GNNs and its adaptation for GMNN uses $k$ train/validation/test splits $\mathcal{D}^{(i)}_{\mathrm{in-train}}, \mathcal{D}^{(i)}_{\mathrm{valid}}, \mathcal{D}^{(i)}_{\mathrm{test}}$ which are created from $k$ stratified folds $\mathcal{F}_i$.