Table of Contents
Fetching ...

Optimal Exact Recovery in Semi-Supervised Learning: A Study of Spectral Methods and Graph Convolutional Networks

Hai-Xiao Wang, Zhichao Wang

TL;DR

This work addresses semi-supervised node classification on CSBM by establishing the information-theoretic threshold for exact recovery under transductive learning. It introduces a PCA-inspired optimal spectral estimator that fuses adjacency and Gaussian-MMM features, and it analyzes graph ridge regression and GCNs, showing they can reach the IT limit with optimally tuned self-loops. The IT boundary, expressed as $I(a_{\tau}, b_{\tau}, c_{\tau})$, governs exact recovery with phase transitions at $I_{\tau}=1$ and decays as $e^{-I_{\tau} q_m}$ when $I_{\tau}\le 1$. The results underscore the importance of feature learning in GCNs and provide practical guidance for self-loop optimization to meet information-theoretic limits, with potential extensions to more complex graphs and non-linear GCNs.

Abstract

We delve into the challenge of semi-supervised node classification on the Contextual Stochastic Block Model (CSBM) dataset. Here, nodes from the two-cluster Stochastic Block Model (SBM) are coupled with feature vectors, which are derived from a Gaussian Mixture Model (GMM) that corresponds to their respective node labels. With only a subset of the CSBM node labels accessible for training, our primary objective becomes the accurate classification of the remaining nodes. Venturing into the transductive learning landscape, we, for the first time, pinpoint the information-theoretical threshold for the exact recovery of all test nodes in CSBM. Concurrently, we design an optimal spectral estimator inspired by Principal Component Analysis (PCA) with the training labels and essential data from both the adjacency matrix and feature vectors. We also evaluate the efficacy of graph ridge regression and Graph Convolutional Networks (GCN) on this synthetic dataset. Our findings underscore that graph ridge regression and GCN possess the ability to achieve the information threshold of exact recovery in a manner akin to the optimal estimator when using the optimal weighted self-loops. This highlights the potential role of feature learning in augmenting the proficiency of GCN, especially in the realm of semi-supervised learning.

Optimal Exact Recovery in Semi-Supervised Learning: A Study of Spectral Methods and Graph Convolutional Networks

TL;DR

This work addresses semi-supervised node classification on CSBM by establishing the information-theoretic threshold for exact recovery under transductive learning. It introduces a PCA-inspired optimal spectral estimator that fuses adjacency and Gaussian-MMM features, and it analyzes graph ridge regression and GCNs, showing they can reach the IT limit with optimally tuned self-loops. The IT boundary, expressed as , governs exact recovery with phase transitions at and decays as when . The results underscore the importance of feature learning in GCNs and provide practical guidance for self-loop optimization to meet information-theoretic limits, with potential extensions to more complex graphs and non-linear GCNs.

Abstract

We delve into the challenge of semi-supervised node classification on the Contextual Stochastic Block Model (CSBM) dataset. Here, nodes from the two-cluster Stochastic Block Model (SBM) are coupled with feature vectors, which are derived from a Gaussian Mixture Model (GMM) that corresponds to their respective node labels. With only a subset of the CSBM node labels accessible for training, our primary objective becomes the accurate classification of the remaining nodes. Venturing into the transductive learning landscape, we, for the first time, pinpoint the information-theoretical threshold for the exact recovery of all test nodes in CSBM. Concurrently, we design an optimal spectral estimator inspired by Principal Component Analysis (PCA) with the training labels and essential data from both the adjacency matrix and feature vectors. We also evaluate the efficacy of graph ridge regression and Graph Convolutional Networks (GCN) on this synthetic dataset. Our findings underscore that graph ridge regression and GCN possess the ability to achieve the information threshold of exact recovery in a manner akin to the optimal estimator when using the optimal weighted self-loops. This highlights the potential role of feature learning in augmenting the proficiency of GCN, especially in the realm of semi-supervised learning.

Paper Structure

This paper contains 38 sections, 35 theorems, 194 equations, 7 figures, 3 algorithms.

Key Result

Theorem 3.2

Under ass:asymptotics with $q_m = \log(m)$, as $m \to \infty$, every algorithm will mis-classify at least $2$ vertices with probability tending to $1$ if $I(a_{\tau}, b_{\tau}, c_{\tau}) < 1$.

Figures (7)

  • Figure 1: An example of SBM under semi-supervised learning. Red: ${\mathcal{V}}_{{\mathbb{L}},+}$; blue: ${\mathcal{V}}_{{\mathbb{L}},-}$; yellow: ${\mathcal{V}}_{{\mathbb{U}},+}$; and orange ${\mathcal{V}}_{{\mathbb{U}},-}$.
  • Figure 2: Performance of $\widehat{{\boldsymbol y}}_{\mathrm{PCA}}$ in \ref{['eqn:pcaEstimator']}: fix $N = 800$, $\tau = 0.25$ and vary $a$ ($y$-axis) and $b$ ($x$-axis) from $1$ to $10.5$. For each parameter configuration $(a_{\tau}, b_{\tau}, c_{\tau})$, we compute the frequency of exact recovery over $20$ independent runs. Light color represents a high chance of success. Phase transitions occurs at the red curve $I(a_{\tau}, b_{\tau}, c_{\tau}) = 1$, as proved by Theorems \ref{['thm:impossibility_CSBM']} and \ref{['thm:achievability_CSBM']}.
  • Figure 3: The $y$-axis is $q_m^{-1}\log(\mathbb{E} \psi_m)$, the average mismatch ratio on the logarithmic scale. The $x$-axis is $a$, varying from $0$ to $10.5$. Fix $b = 5$, $\tau = 0.25$, $c_{\tau} = 0.5$. The red curve is $-I(a_{\tau}, b_{\tau}, c_{\tau})$, the lower bound predicted by Theorem \ref{['thm:ITlowerbounds_CSBM']}. The experiments over different $N$ shows that $\widehat{{\boldsymbol y}}_{\mathrm{PCA}}$ achieves the information-theoretical limits, as proved in Theorems \ref{['thm:achievability_CSBM']} and \ref{['thm:general_pcaEstimator']}.
  • Figure 4: Performance of $\widehat{{\boldsymbol y}}_{\mathrm{LRR}}$ in \ref{['eq:regression_solu_y']}. Fix $N = 800$, $\tau = 0.25$, $c_{\tau} = 0.5$. Compute the frequency of exact recovery over $20$ independent runs. When $I(a_{\tau}, b_{\tau}, c_{\tau}) > 1$, $\widehat{{\boldsymbol y}}_{\mathrm{LRR}}$ achieves exact recovery, as proved in Theorem \ref{['thm:exact_linear']} (a) and (b).
  • Figure 5: The $y$-axis is $\mathbb{E} \psi_m$ , the average mismatch ratio over $20$ independent runs. The $x$-axis is $a$, varying from $0$ to $10.5$. Fix $b = 4$, $\tau = 0.25$, $c_{\tau} = 0.5$, $N = 400$. The red curve is $m^{-I(a_{\tau}, b_{\tau}, c_{\tau})}$, the lower bound predicted by Theorem \ref{['thm:ITlowerbounds_CSBM']} with $q_m = \log(m)$. This experiment shows that $\widehat{{\boldsymbol y}}_{\mathrm{LRR}}$ achieves a lower mismatch ratio when adding self-loop in the area $I(a_{\tau}, b_{\tau}, c_{\tau}) < 1$, where the exact recovery is impossible.
  • ...and 2 more figures

Theorems & Definitions (71)

  • Definition 2.1: Exact recovery
  • Definition 2.2: Binary Stochastic Block Model, SBM
  • Definition 2.3: Gaussian Mixture Model, GMM
  • Definition 2.4: Contextual Stochastic Block Model, CSBM
  • Definition 2.5: Semisupervised CSBM
  • Remark 2.6
  • Definition 2.7
  • Theorem 3.2: Impossibility
  • Theorem 3.3
  • Theorem 3.4
  • ...and 61 more