Revisiting Active Learning in the Era of Vision Foundation Models
Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J. Nirschl, Serena Yeung-Levy
TL;DR
This paper tackles labeling efficiency in image classification under limited budgets by re-examining active learning in the era of large vision foundation models (e.g., $f$ = DINOv2 or OpenCLIP). It analyzes four AL pillars—initial pool selection, diversity, representative versus uncertainty sampling, and leveraging unlabeled data—using frozen backbone embeddings with a linear classifier. The authors find that centroid-based initialization improves cold-start performance, uncertainty-based querying remains strong in low-budget settings, and simple diversity mechanisms suffice, while semi-supervised label propagation offers limited, dataset-dependent gains. They propose DropQuery, which integrates centroid initialization, dropout-driven uncertainty with $M=3$, and clustering to select a diverse, informative batch of size $B$ per iteration, and demonstrate its superiority across natural, cross-domain biomedical, and large-scale datasets with efficient offline implementation. The work provides public code and highlights that foundation-model embeddings warrant a shift in AL design, enabling powerful, scalable querying in diverse real-world scenarios.
Abstract
Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.
