Table of Contents
Fetching ...

Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training

Paul Doucet, Benjamin Estermann, Till Aczel, Roger Wattenhofer

TL;DR

This paper tackles the data-labeling bottleneck in active learning by combining diversity-based and uncertainty-based sampling through a simple hybrid, TCM, in the presence of self-supervised pre-training. TCM starts with TypiClust to cover the data distribution and then switches to Margin to refine decision boundaries, guided by a transition-point heuristic and a robust step-size rule. Across CIFAR10/100 and ISIC2019, including long-tail variants, TCM consistently outperforms its constituent methods and other baselines, leveraging pre-trained backbones to enable early transition. The approach provides practical, data-budget-agnostic guidelines for practitioners and demonstrates that a lightweight hybrid strategy can yield strong, stable performance across diverse data regimes. The work highlights the value of self-supervised representations in simplifying active learning dynamics and reducing reliance on complex switching mechanisms.

Abstract

This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.

Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training

TL;DR

This paper tackles the data-labeling bottleneck in active learning by combining diversity-based and uncertainty-based sampling through a simple hybrid, TCM, in the presence of self-supervised pre-training. TCM starts with TypiClust to cover the data distribution and then switches to Margin to refine decision boundaries, guided by a transition-point heuristic and a robust step-size rule. Across CIFAR10/100 and ISIC2019, including long-tail variants, TCM consistently outperforms its constituent methods and other baselines, leveraging pre-trained backbones to enable early transition. The approach provides practical, data-budget-agnostic guidelines for practitioners and demonstrates that a lightweight hybrid strategy can yield strong, stable performance across diverse data regimes. The work highlights the value of self-supervised representations in simplifying active learning dynamics and reducing reliance on complex switching mechanisms.

Abstract

This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.
Paper Structure (13 sections, 4 figures)

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: Accuracy improvement compared to random for all baselines and our (TCM) strategy. The accuracy improvement mean and standard deviation is computed over all budget sizes for CIFAR10, CIFAR100 and ISIC2019.
  • Figure 2: Transition point ablation on the CIFAR10 dataset. Switching to Margin in the last step is equal to $N=10$, while only using TypiClust for the initial sampling, and switching to Margin imminently is $N=1$.
  • Figure 3: Step size ablation for TCM on the CIFAR10 dataset. For each regime, we evaluate three different step sizes $S$. Overall, there is no clear performance difference between the step sizes.
  • Figure 4: Accuracy improvement compared to random for all baselines and our (TCM) strategy. The accuracy improvement mean is computed over all 4 budget sizes tiny, small, medium, and large. Standard deviation is aggregated with respect to the random seed. The top row shows the main evaluated datasets, while the bottom row shows an ablation on the imbalanced versions of CIFAR10 and CIFAR100. For all imbalanced datasets, reported accuracy is balanced by computing the average of recall obtained for each class. TCM shows consistently strong performance for all datasets, even for datasets for which TypiClust or Margin on their own show suboptimal performance. Coreset shows strong performance on the LT datasets. Unfortunately, this performance does not transfer to the real-life imbalanced dataset ISIC2019.