Table of Contents
Fetching ...

Optimal Inference in Contextual Stochastic Block Models

O. Duranthon, L. Zdeborová

TL;DR

It is shown that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature, suggesting that the cSBM, along with the comparison to theperformance of the optimal algorithm, can be instrumental in the development of more performant GNN architecture.

Abstract

The contextual stochastic block model (cSBM) was proposed for unsupervised community detection on attributed graphs where both the graph and the high-dimensional node information correlate with node labels. In the context of machine learning on graphs, the cSBM has been widely used as a synthetic dataset for evaluating the performance of graph-neural networks (GNNs) for semi-supervised node classification. We consider a probabilistic Bayes-optimal formulation of the inference problem and we derive a belief-propagation-based algorithm for the semi-supervised cSBM; we conjecture it is optimal in the considered setting and we provide its implementation. We show that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature. This suggests that the cSBM, along with the comparison to the performance of the optimal algorithm, readily accessible via our implementation, can be instrumental in the development of more performant GNN architectures.

Optimal Inference in Contextual Stochastic Block Models

TL;DR

It is shown that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature, suggesting that the cSBM, along with the comparison to theperformance of the optimal algorithm, can be instrumental in the development of more performant GNN architecture.

Abstract

The contextual stochastic block model (cSBM) was proposed for unsupervised community detection on attributed graphs where both the graph and the high-dimensional node information correlate with node labels. In the context of machine learning on graphs, the cSBM has been widely used as a synthetic dataset for evaluating the performance of graph-neural networks (GNNs) for semi-supervised node classification. We consider a probabilistic Bayes-optimal formulation of the inference problem and we derive a belief-propagation-based algorithm for the semi-supervised cSBM; we conjecture it is optimal in the considered setting and we provide its implementation. We show that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature. This suggests that the cSBM, along with the comparison to the performance of the optimal algorithm, readily accessible via our implementation, can be instrumental in the development of more performant GNN architectures.
Paper Structure (34 sections, 61 equations, 9 figures)

This paper contains 34 sections, 61 equations, 9 figures.

Figures (9)

  • Figure 1: Convergence to the high-dimensional limit. Overlap $q_U$ of the fixed point of AMP--BP vs the snr $\lambda$ for several system sizes $N$. Left: unsupervised case, $\rho=0$. Right: semi-supervised, $\rho=0.1$. The other parameters are $\alpha=10$, $\mu^2=4$, $d=5$. We run ten experiments per point.
  • Figure 2: Performances of AMP--BP and of the spectral algorithm of cSBM18 sec. 4. Overlap $q_U$ of the fixed point of the algorithms, vs snr $\lambda$ for a range of ratios $\alpha$. Left: unsupervised, $\rho=0$; right: semi-supervised, $\rho=0.1$. Vertical dashed lines on the left: theoretical thresholds $\lambda_c$ to partial recovery, eq. \ref{['eq:lambda_c']}. $N=3\times 10^4$, $\mu^2=4$, $d=5$. We run ten experiments per point.
  • Figure 3: Comparison against GPR-GNN pageRankGNN20. Overlap $q_U$ achieved by the algorithms, vs $\varphi=\frac{2}{\pi}\arctan(\frac{\lambda\sqrt\alpha}{\mu})$. Left: few nodes revealed $\rho=0.025$; right: more nodes revealed $\rho=0.6$. For GPR-GNN we plot the results of Fig. 2 and tables 5 and 6 from pageRankGNN20. $N=5\times 10^3$, $\alpha=2.5$, $\epsilon=3.25$, $d=5$. We run ten experiments per point for AMP--BP.
  • Figure 4: Comparison to GNNs of various architectures and convergence to a high-dimensional limit. Overlap $q_U$ achieved by the GNNs, vs the snr $\lambda$. Left: general convolution for different numbers of layers $K$; middle: for different types of convolutions, at the best $K$ (the detailed results for every $K$ are reported on Fig. \ref{['fig:comparisonK_bis']} of appendix \ref{['sec:appendixFigures']}); right: general convolution at $K=3$ for different sizes $N$. The other parameters are $N=3\times 10^4$, $\alpha=10$, $\mu^2=4$, $d=5$, $\rho=0.1$. We run five experiments per point.
  • Figure 5: Comparison against clipGNN baranwal23clipGNN. Overlap $q_U$ achieved by the algorithms, vs $\lambda$. $l$ is the size of the neigborhood clipGNN processes. Left:$\mu^2=50$ and $\alpha=50$ i.e. $P=200$; right:$\mu^2=500$ and $\alpha=500$ i.e. $P=20$. The other parameters are $N=10^4$ ($N=5\times 10^3$ for the two largest $l$), $\rho=0.05$, $d=5$, $L=1$. For clipGNN we run the code kindly provided by the authors; we run five experiments per point.
  • ...and 4 more figures