Table of Contents
Fetching ...

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu

TL;DR

Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning and Semi-Supervised Active Learning, even with constrained annotation budgets.

Abstract

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $α$-Maximum Mean Discrepancy ($α$-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing $α$-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

TL;DR

Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning and Semi-Supervised Active Learning, even with constrained annotation budgets.

Abstract

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion -Maximum Mean Discrepancy (-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing -MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.
Paper Structure (33 sections, 8 theorems, 46 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 33 sections, 8 theorems, 46 equations, 5 figures, 6 tables, 2 algorithms.

Key Result

Theorem 5.4

Take $k=k_1^2+k_1k_2+k_3$, then under assumptions 1-3, for any selected samples $S\subset T$, there exists a positive constant $K_c$ such that the following inequality holds: where $0\le\alpha\le 1$, $0\le \max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K$ and $\mathbf{X}_S,\mathbf{X}_T$ are projections of $S,T$ on $\mathcal{X}$.

Figures (5)

  • Figure 1: Visualization of selected samples from a dog dataset. The red and grey circles respectively symbolize the selected and unselected samples. a) The selected samples often contain an excessive number of highly similar instances, leading to redundancy; b) The selected samples contain too many edge points, unable to cover the entire dataset; c) The selected samples represent the entire dataset comprehensively and accurately.
  • Figure 2: The performance comparison between GKHR and GKH with different $m,n$ over ten independent runs. The blue line is the mean value of $D$, the red dotted line over (under) the blue line is the mean value of $D$ plus (minus) its standard deviation, and the pink area is the area between the upper and lower red dotted lines.
  • Figure 3: Visualization of selected samples using different sampling methods. Points of different colours represent samples from different classes, while black points indicate the selected samples.
  • Figure 4: Comparison with AL/SSAL approaches on CIFAR-10.
  • Figure 5: Comparison with AL/SSAL approaches on CIFAR-100.

Theorems & Definitions (14)

  • Definition 4.1: Maximum Mean Discrepancy
  • Theorem 5.4
  • Theorem 5.6
  • proof
  • Lemma B.1: Lemma 2 pronzato2021performance
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof
  • Lemma B.4: Proposition 12.31 wainwright2019high
  • ...and 4 more