Almost exact recovery in noisy semi-supervised learning
Konstantin Avrachenkov, Maximilien Dreveton
TL;DR
This work tackles semi-supervised clustering on graphs under label noise by modeling the graph with a two-class DC-SBM and an informative noisy oracle. It derives a MAP estimator that blends a graph-based cut/regularization term with a loss term enforcing agreement with observed labels, and then relaxes this NP-hard problem into a continuous spectral-type method using a regularized adjacency matrix $A_\tau$. The authors prove that, in the diverging-mean-degree regime with an informative oracle, the relaxed method achieves almost exact recovery, providing high-probability bounds on the misclassification rate and extending the analysis to mean-field and concentration arguments. Empirical results on synthetic and real data (including MNIST and standard networks) demonstrate robustness to oracle noise and competitiveness with existing SSL approaches, illustrating practical impact for graph-based clustering with imperfect side information.
Abstract
Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
