Table of Contents
Fetching ...

Probability-density-aware Semi-supervised Learning

Shuyang Liu, Ruiqiu Zheng, Yunhang Shen, Ke Li, Xing Sun, Zhou Yu, Shaohui Lin

TL;DR

This work addresses the gap in semi-supervised learning where cluster structure is underutilized by conventional similarity-based neighbor measures. It introduces a Probability-Density-Aware Measure (PM) that incorporates density information along paths between data points, and a Density-aware Label Propagation algorithm (PMLP) that uses PM to propagate high-confidence pseudo-labels more reliably. The authors provide theoretical results: a statistical explanation of the cluster assumption (density-path behavior) and a proof that PM-based LPA improves over traditional LPA, with pseudo-labeling shown as a special case of PMLP as density influence grows. Empirically, PMLP yields state-of-the-art or near-state-of-the-art performance on multiple benchmarks (SVHN, CIFAR10/100, STL-10), improves pseudo-label quality, and can be accelerated via GPU KDE, offering a practical, density-aware framework for SSL adoption in real-world settings.

Abstract

Semi-supervised learning (SSL) assumes that neighbor points lie in the same category (neighbor assumption), and points in different clusters belong to various categories (cluster assumption). Existing methods usually rely on similarity measures to retrieve the similar neighbor points, ignoring cluster assumption, which may not utilize unlabeled information sufficiently and effectively. This paper first provides a systematical investigation into the significant role of probability density in SSL and lays a solid theoretical foundation for cluster assumption. To this end, we introduce a Probability-Density-Aware Measure (PM) to discern the similarity between neighbor points. To further improve Label Propagation, we also design a Probability-Density-Aware Measure Label Propagation (PMLP) algorithm to fully consider the cluster assumption in label propagation. Last but not least, we prove that traditional pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

Probability-density-aware Semi-supervised Learning

TL;DR

This work addresses the gap in semi-supervised learning where cluster structure is underutilized by conventional similarity-based neighbor measures. It introduces a Probability-Density-Aware Measure (PM) that incorporates density information along paths between data points, and a Density-aware Label Propagation algorithm (PMLP) that uses PM to propagate high-confidence pseudo-labels more reliably. The authors provide theoretical results: a statistical explanation of the cluster assumption (density-path behavior) and a proof that PM-based LPA improves over traditional LPA, with pseudo-labeling shown as a special case of PMLP as density influence grows. Empirically, PMLP yields state-of-the-art or near-state-of-the-art performance on multiple benchmarks (SVHN, CIFAR10/100, STL-10), improves pseudo-label quality, and can be accelerated via GPU KDE, offering a practical, density-aware framework for SSL adoption in real-world settings.

Abstract

Semi-supervised learning (SSL) assumes that neighbor points lie in the same category (neighbor assumption), and points in different clusters belong to various categories (cluster assumption). Existing methods usually rely on similarity measures to retrieve the similar neighbor points, ignoring cluster assumption, which may not utilize unlabeled information sufficiently and effectively. This paper first provides a systematical investigation into the significant role of probability density in SSL and lays a solid theoretical foundation for cluster assumption. To this end, we introduce a Probability-Density-Aware Measure (PM) to discern the similarity between neighbor points. To further improve Label Propagation, we also design a Probability-Density-Aware Measure Label Propagation (PMLP) algorithm to fully consider the cluster assumption in label propagation. Last but not least, we prove that traditional pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

Paper Structure

This paper contains 30 sections, 91 equations, 6 figures, 8 tables, 3 algorithms.

Figures (6)

  • Figure 1: The procedure of PMLP. First, we select the neighbor points and extract their features. Then, we calculate the densities on the path; the density information is used to construct the Probability-density-aware Measure(PM). PM can fully consider the cluster assumption. Finally, high-confidence predictions are used for pseudo-labeling with an affinity matrix.
  • Figure 2: Left: ema-accuracy of models. Middle: rate of high-quality pseudo-labels. Right: rate of correct high-quality labels.
  • Figure 3: Different colors represent different distances between the target point and neighbor points. Black represents a close distance and a higher affinity. The left one chooses PM as the distance measure, and the right one chooses traditional first-order similarity. PMLP tends to choose neighbors within one cluster, and LPA equally chooses neighbors with different clusters.
  • Figure 4: The accuracy, rate of high-quality predictions, and accuracy of pseudo-labels on CIFAR10 with 250 labeled data, CIFAR100 with 2500 labeled data, and SVHN with 250 labeled data. It can be seen that PMLP can still produce more correct pseudo-labels, which conform to our conclusion in the main context.
  • Figure 5: The accuracy and rate of high-quality predictions on STL-10 with 40 and 1000 labeled data. It can be seen that PMLP can still produce more correct pseudo-labels, which conform to our conclusion in the main context.
  • ...and 1 more figures