Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training
Paul Doucet, Benjamin Estermann, Till Aczel, Roger Wattenhofer
TL;DR
This paper tackles the data-labeling bottleneck in active learning by combining diversity-based and uncertainty-based sampling through a simple hybrid, TCM, in the presence of self-supervised pre-training. TCM starts with TypiClust to cover the data distribution and then switches to Margin to refine decision boundaries, guided by a transition-point heuristic and a robust step-size rule. Across CIFAR10/100 and ISIC2019, including long-tail variants, TCM consistently outperforms its constituent methods and other baselines, leveraging pre-trained backbones to enable early transition. The approach provides practical, data-budget-agnostic guidelines for practitioners and demonstrates that a lightweight hybrid strategy can yield strong, stable performance across diverse data regimes. The work highlights the value of self-supervised representations in simplifying active learning dynamics and reducing reliance on complex switching mechanisms.
Abstract
This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.
