Table of Contents
Fetching ...

Confident Sinkhorn Allocation for Pseudo-Labeling

Vu Nguyen, Hisham Husain, Sachin Farfade, Anton van den Hengel

TL;DR

The role of uncertainty to pseudo-labeling is studied and Confident Sinkhorn Allocation (CSA) is proposed, which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores and outperforms the current state-of-the-art in this practically important area of semi-supervised learning.

Abstract

Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. However, the greedy mechanism in PL is sensitive to a threshold and can perform poorly if wrong assignments are made due to overconfidence. This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PACBayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.

Confident Sinkhorn Allocation for Pseudo-Labeling

TL;DR

The role of uncertainty to pseudo-labeling is studied and Confident Sinkhorn Allocation (CSA) is proposed, which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores and outperforms the current state-of-the-art in this practically important area of semi-supervised learning.

Abstract

Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. However, the greedy mechanism in PL is sensitive to a threshold and can perform poorly if wrong assignments are made due to overconfidence. This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PACBayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.
Paper Structure (30 sections, 6 theorems, 45 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 6 theorems, 45 equations, 10 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

For $\delta>0$, our estimate satisfies $\sum_{k=1}^K \left| \hat{\theta}_k - \mu_k \right| \le \delta$ with a probability at least $1- 2 \sum_{k=1}^K \sum_{j=1}^d \exp \{ -\frac{ \delta^2 \tilde{n}_k }{8 \sigma_j^2 } \} - \sum_{k=1}^K \frac{4 \texttt{Var}(I^k)}{ \delta^2} \left| \mu_k - \mu_{\set

Figures (10)

  • Figure 1: A depiction of CSA in application. We estimate the ensemble of predictions $P$ on unlabeled data using $M$ classifiers which can result in different probabilities. We then identify high-confidence samples by performing a T-test. Next, we estimate the label assignment $Q$ using Sinkhorn's algorithm. The cost $C$ of the optimal transport problem is the negative of the probability averaging across classifiers, $C=-\log \bar{P}$. We repeat the process on the remaining unlabeled data as required.
  • Figure 2: T-test for estimating the confidence level on $K=3$ classes. The yellow area indicates high confidence. Samples in the dark area will be excluded (T-test score $\le 2$).
  • Figure 3: Comparison with PL methods
  • Figure 4: Comparison between CSA versus pseudo-labeling with different thresholds $\gamma$. Left: the assignments by CSA can not be simply achieved by varying the threshold in PL, eg. see the locations highlighted in red square. Right: Comparison when varying thresholds in PL against CSA.
  • Figure 5: Performance on Digit w.r.t. the number of unlabeled samples given the number of labeled samples as $100,200,500$, respectively.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Theorem 1
  • proof
  • Theorem 2
  • Theorem 3: restated Theorem \ref{['theorem_classifier_bound']}
  • proof
  • Proposition 1.1: amit2022integral
  • Theorem 4: masegosa2020second
  • Theorem 5: Theorem \ref{['main:pac-bayes']} in the main paper
  • proof