A Probabilistic Model for Node Classification in Directed Graphs

Diego Huerta; Gerardo Arizmendi

A Probabilistic Model for Node Classification in Directed Graphs

Diego Huerta, Gerardo Arizmendi

TL;DR

This work develops a probabilistic classifier for directed graphs with node attributes, enabling inductive label prediction for unseen nodes via ML or MAP estimation. The model explicitly specifies a generative process using parameters $\pi$, $\Theta$, $\Xi$ and conditional distributions $\psi_i$, $\phi_i$, and $\omega_i$, with interpretable terms derived from the first-order neighborhood and node attributes. It demonstrates competitive performance against neural baselines on two datasets—the Math Genealogy Project and ogbn-arxiv—while offering clear interpretability of each decision component. The authors also introduce a new MGp-derived dataset and provide comprehensive baselines and hyperparameter strategies, highlighting practical applicability and efficiency for large graphs with textual attributes.

Abstract

In this work, we present a probabilistic model for directed graphs where nodes have attributes and labels. This model serves as a generative classifier capable of predicting the labels of unseen nodes using either maximum likelihood or maximum a posteriori estimations. The predictions made by this model are highly interpretable, contrasting with some common methods for node classification, such as graph neural networks. We applied the model to two datasets, demonstrating predictive performance that is competitive with, and even superior to, state-of-the-art methods. One of the datasets considered is adapted from the Math Genealogy Project, which has not previously been utilized for this purpose. Consequently, we evaluated several classification algorithms on this dataset to compare the performance of our model and provide benchmarks for this new resource.

A Probabilistic Model for Node Classification in Directed Graphs

TL;DR

and conditional distributions

, and

, with interpretable terms derived from the first-order neighborhood and node attributes. It demonstrates competitive performance against neural baselines on two datasets—the Math Genealogy Project and ogbn-arxiv—while offering clear interpretability of each decision component. The authors also introduce a new MGp-derived dataset and provide comprehensive baselines and hyperparameter strategies, highlighting practical applicability and efficiency for large graphs with textual attributes.

Abstract

Paper Structure (36 sections, 43 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 43 equations, 5 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Probability theory
Multinomial distribution
Discrete truncated power law
Log-normal distribution
Machine Learning Models for Classification
Naive Bayes
BERT
Graph Convolutional Networks
Model
Parameter estimation
Node Classification
Prediction over a single node
Maximum Likelihood Estimate
...and 21 more sections

Figures (5)

Figure 1: Example of notation. Let the label of a node denote its color, and consider $\mathcal{Y} = \{ 1, 2, 3\}$ where label 1 indicates red, label 2 indicates green, and label 3 indicates purple. Therefore, $y_3 = 2$, $N^\text{in}(3) = \{1, 2\}$, $N^\text{out}(3) = \{4, 5, 6\}$$d_3^{\text{in}} = 2$, $d_3^{\text{out}} = 3$, $p_3 = (1, 1, 0)$ and $s_3 = (2, 0, 1)$.
Figure 2: Induced subgraph of size 500 of the resulted directed graph representing the information of the Math Genealogy Project. The size of a node is given by its out degree. The distinct colors represent different MSC of the nodes, where the nodes with missing MSC are represented in black.
Figure 3: Chi-squared goodness-of-fit for the out degree of nodes with label 46 (68—Computer science). For this sample, the test statistic takes the value $T = 12.109$, yielding a p-value of 0.35. Therefore, the test concludes that the data follows the proposed distribution.
Figure 4: Chi-squared goodness-of-fit for the out degree of nodes with label 58 (91—Game theory, economics, social and behavioral sciences). For this sample, the test statistic takes the value $T = 25.5$, yielding a p-value of 0.007. Therefore, the test concludes that the data does not follow the proposed distribution. However, this distribution appears to be a good approximation.
Figure 5: Chi-squared goodness-of-fit for the in degree of nodes with label 2. For this sample, the test statistic takes the value $T = 14.36$, yielding a p-value of 0.21. Therefore, the test concludes that the data follows the proposed distribution.

A Probabilistic Model for Node Classification in Directed Graphs

TL;DR

Abstract

A Probabilistic Model for Node Classification in Directed Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)