Table of Contents
Fetching ...

Improving self-training under distribution shifts via anchored confidence with theoretical guarantees

Taejong Joo, Diego Klabjan

TL;DR

This work builds an uncertainty-aware temporal ensemble with a simple relative thresholding that smooths noisy pseudo labels to promote selective temporal consistency and shows that this temporal ensemble is asymptotically correct and the label smoothing technique can reduce the optimality gap of self-training.

Abstract

Self-training often falls short under distribution shifts due to an increased discrepancy between prediction confidence and actual accuracy. This typically necessitates computationally demanding methods such as neighborhood or ensemble-based label corrections. Drawing inspiration from insights on early learning regularization, we develop a principled method to improve self-training under distribution shifts based on temporal consistency. Specifically, we build an uncertainty-aware temporal ensemble with a simple relative thresholding. Then, this ensemble smooths noisy pseudo labels to promote selective temporal consistency. We show that our temporal ensemble is asymptotically correct and our label smoothing technique can reduce the optimality gap of self-training. Our extensive experiments validate that our approach consistently improves self-training performances by 8% to 16% across diverse distribution shift scenarios without a computational overhead. Besides, our method exhibits attractive properties, such as improved calibration performance and robustness to different hyperparameter choices.

Improving self-training under distribution shifts via anchored confidence with theoretical guarantees

TL;DR

This work builds an uncertainty-aware temporal ensemble with a simple relative thresholding that smooths noisy pseudo labels to promote selective temporal consistency and shows that this temporal ensemble is asymptotically correct and the label smoothing technique can reduce the optimality gap of self-training.

Abstract

Self-training often falls short under distribution shifts due to an increased discrepancy between prediction confidence and actual accuracy. This typically necessitates computationally demanding methods such as neighborhood or ensemble-based label corrections. Drawing inspiration from insights on early learning regularization, we develop a principled method to improve self-training under distribution shifts based on temporal consistency. Specifically, we build an uncertainty-aware temporal ensemble with a simple relative thresholding. Then, this ensemble smooths noisy pseudo labels to promote selective temporal consistency. We show that our temporal ensemble is asymptotically correct and our label smoothing technique can reduce the optimality gap of self-training. Our extensive experiments validate that our approach consistently improves self-training performances by 8% to 16% across diverse distribution shift scenarios without a computational overhead. Besides, our method exhibits attractive properties, such as improved calibration performance and robustness to different hyperparameter choices.

Paper Structure

This paper contains 39 sections, 6 theorems, 30 equations, 6 figures, 11 tables.

Key Result

Theorem 3.1

Let $A_i(c) := \{x \in {\mathcal{X}} | c(x; \theta_i) > c \}$, $Q(x; \mathbf{c}_{0:m}) := \sum_{i=0}^{m} \mathbf{1}(x \in A_i(c_i))$, and $\bar{p}(x; \mathbf{c}_{0:m}) = \frac{1}{Q(x; \mathbf{c}_{0:m})} \sum_{i=0}^{m} \mathbb{E}_{Y|X=x}[\mathbf{1}(Y(x) = \hat{Y}(x; \theta_i))] \mathbf{1}(x \in A_i(c where $\xi(z) := 2z - 1 - \log(2z)$ is a positive increasing function in $z \in [0.5,1]$.

Figures (6)

  • Figure 1: Section \ref{['sec:exp_corruption']}: (a) Test accuracy for each intensity level in ImageNet-C. (b) Performance degeneration in the defocus blur corruption with intensity 4. Section \ref{['subsec:robust_model_selec']}: (c) Maximum performance changes under different model selection methods. We present performances for individual corruptions in Appendix. For all boxplots used in the paper, the box represents interquantile range with whiskers as $\pm$ 1.5 interquantile range and the horizontal line inside the box represents the median.
  • Figure 2: Sensitivity analysis with respect to $\lambda$ and $\beta$ on four domain pairs (Ar-Pr, Pr-Cl, Rw-Cl, Rw-Pr) in OfficeHome. Here, green triangles are means.
  • Figure 3: (a) ECEs under five levels of intensities in ImageNet-C; (b) Accuracy and ECE changes during the course of training in VisDa.
  • Figure 4: (a) The accuracy of the generalized temporal ensemble along with the number of confident samples under different degrees of distribution shifts. Here, the temporal ensemble is constructed by averaging all predictions over iterations. (b) On-average accuracies per the number of confident samples over iterations under different thresholding rules.
  • Figure 5: (a) Counting the number of pseudo labels for each class with 5,000 training samples in ImageNet-C over 100 training epochs, which shows that the marginal distribution of pseudo labels barely changes during training. (b) Changes in the total variation distance of the marginal distributions of the pseudo labels for each two consecutive epochs.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.2.1
  • proof
  • proof
  • proof
  • Lemma B.4
  • proof
  • Lemma B.5: Modification from safaryan2024knowledge
  • proof
  • ...and 1 more