Table of Contents
Fetching ...

Almost exact recovery in noisy semi-supervised learning

Konstantin Avrachenkov, Maximilien Dreveton

TL;DR

This work tackles semi-supervised clustering on graphs under label noise by modeling the graph with a two-class DC-SBM and an informative noisy oracle. It derives a MAP estimator that blends a graph-based cut/regularization term with a loss term enforcing agreement with observed labels, and then relaxes this NP-hard problem into a continuous spectral-type method using a regularized adjacency matrix $A_\tau$. The authors prove that, in the diverging-mean-degree regime with an informative oracle, the relaxed method achieves almost exact recovery, providing high-probability bounds on the misclassification rate and extending the analysis to mean-field and concentration arguments. Empirical results on synthetic and real data (including MNIST and standard networks) demonstrate robustness to oracle noise and competitiveness with existing SSL approaches, illustrating practical impact for graph-based clustering with imperfect side information.

Abstract

Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.

Almost exact recovery in noisy semi-supervised learning

TL;DR

This work tackles semi-supervised clustering on graphs under label noise by modeling the graph with a two-class DC-SBM and an informative noisy oracle. It derives a MAP estimator that blends a graph-based cut/regularization term with a loss term enforcing agreement with observed labels, and then relaxes this NP-hard problem into a continuous spectral-type method using a regularized adjacency matrix . The authors prove that, in the diverging-mean-degree regime with an informative oracle, the relaxed method achieves almost exact recovery, providing high-probability bounds on the misclassification rate and extending the analysis to mean-field and concentration arguments. Empirical results on synthetic and real data (including MNIST and standard networks) demonstrate robustness to oracle noise and competitiveness with existing SSL approaches, illustrating practical impact for graph-based clustering with imperfect side information.

Abstract

Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.

Paper Structure

This paper contains 24 sections, 15 theorems, 83 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 2.2

Let $G$ be a graph drawn from DC-SBM as defined in def:DCSBM and $s$ be the oracle information as defined in eq:def_oracle. Denote $M \ = \ (F_1-F_0) \odot A + F_0,$ where $F_0 = \left(f^{(0)}_{ij} \right)$ and $F_1 = \left(f^{(1)}_{ij} \right)$ such that $f^{(a)}_{ij} = \log \frac{\mathbb{P}(A_{i For a perfect oracle $(\eta_0 = 0)$ this reduces to

Figures (5)

  • Figure 1: Cost in Algorithm \ref{['algo:SSL-SC-regularized_adjacency_matrix']} with the standard and normalized versions of the constraint, on $50$ realizations of SBM with $n = 500$, $p_{\rm out} = 0.03$ and $50$ labeled nodes with $10\%$ noise.
  • Figure 2: Average accuracy obtained by different semi-supervised clustering methods on DC-SBM graphs, with $n = 2000$, $p_{\rm in} = 0.04$, and $p_{\rm out} = 0.02$ with different distributions for $\theta$. The number of labeled nodes is equal to 40. Accuracies are computed on the unlabeled nodes, and are averaged over 100 realisations; the error bars show the standard error.
  • Figure 3: Average accuracy obtained on a subset of the MNIST data set by different semi-supervised algorithms as a function of the oracle-misclassification ratio, when the number of labeled nodes is equal to $10$. Accuracy is averaged over $100$ random realizations, and the error bars show the standard error.
  • Figure 4: Average accuracy obtained on the unlabeled, correctly labeled, and wrongly labeled nodes by the oracle. Simulations are done on the 1000 digits (2,4). The noisy oracle correctly classifies 24 nodes and misclassifies 16 nodes, and the boxplots show $100$ realizations.
  • Figure 5: Average accuracy obtained on real networks by different semi-supervised algorithms as a function of the oracle-misclassification ratio. The number of labeled nodes is 30 for Political Blogs and LiveJournal, and $100$ for DBLP. Accuracy is averaged over 50 random realizations, and the error bars show the standard error.

Theorems & Definitions (30)

  • Theorem 2.2
  • Corollary 2.3
  • Theorem 3.1
  • proof : Proof of Theorem \ref{['thm:bound_number_misclassified_nodes']}
  • Corollary 3.2: Almost exact recovery in the diverging degree regime
  • proof
  • Corollary 3.3: Detection in the constant degree regime
  • proof
  • proof : Proof of Theorem \ref{['thm:MAP_dcsbm']}
  • proof : Proof of Corollary \ref{['corollary:MAP_SBM']}
  • ...and 20 more