Table of Contents
Fetching ...

Revisiting Active Learning in the Era of Vision Foundation Models

Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J. Nirschl, Serena Yeung-Levy

TL;DR

This paper tackles labeling efficiency in image classification under limited budgets by re-examining active learning in the era of large vision foundation models (e.g., $f$ = DINOv2 or OpenCLIP). It analyzes four AL pillars—initial pool selection, diversity, representative versus uncertainty sampling, and leveraging unlabeled data—using frozen backbone embeddings with a linear classifier. The authors find that centroid-based initialization improves cold-start performance, uncertainty-based querying remains strong in low-budget settings, and simple diversity mechanisms suffice, while semi-supervised label propagation offers limited, dataset-dependent gains. They propose DropQuery, which integrates centroid initialization, dropout-driven uncertainty with $M=3$, and clustering to select a diverse, informative batch of size $B$ per iteration, and demonstrate its superiority across natural, cross-domain biomedical, and large-scale datasets with efficient offline implementation. The work provides public code and highlights that foundation-model embeddings warrant a shift in AL design, enabling powerful, scalable querying in diverse real-world scenarios.

Abstract

Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.

Revisiting Active Learning in the Era of Vision Foundation Models

TL;DR

This paper tackles labeling efficiency in image classification under limited budgets by re-examining active learning in the era of large vision foundation models (e.g., = DINOv2 or OpenCLIP). It analyzes four AL pillars—initial pool selection, diversity, representative versus uncertainty sampling, and leveraging unlabeled data—using frozen backbone embeddings with a linear classifier. The authors find that centroid-based initialization improves cold-start performance, uncertainty-based querying remains strong in low-budget settings, and simple diversity mechanisms suffice, while semi-supervised label propagation offers limited, dataset-dependent gains. They propose DropQuery, which integrates centroid initialization, dropout-driven uncertainty with , and clustering to select a diverse, informative batch of size per iteration, and demonstrate its superiority across natural, cross-domain biomedical, and large-scale datasets with efficient offline implementation. The work provides public code and highlights that foundation-model embeddings warrant a shift in AL design, enabling powerful, scalable querying in diverse real-world scenarios.

Abstract

Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.
Paper Structure (24 sections, 1 equation, 5 figures, 10 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Results of our AL strategy on different representation spaces (i.e. DINOv2 dinov2 and OpenCLIP openclip). The y-axis is the delta in accuracy between iteration $i$ and $i/2$. In early iterations, the improvements to AL query performance are more pronounced for larger models.
  • Figure 2: (Top row) Results on fine-grained natural image classification tasks Stanford Cars stanfordcars, FVGC Aircraft fvgcaircraft, Oxford-IIIT Pets oxfordpets, and the Places365 places365 datasets. (Bottom row) AL curves for biomedical datasets, including images of peripheral blood smears blood_smear, retinal fundoscopy diabetic_retinopathy, HeLa cell structures iicbu_hela, and skin dermoscopy skin ham10000, covering pathology, ophthalmology, cell biology, and dermatology domains using various imaging modalities. Additional biomedical datasets are explored in the Appendix.
  • Figure 3: Win matrices for all AL strategies investigated in our study evaluated on natural image datasets and out-of-domain biomedical image datasets. (a) CIFAR100, Food101, Imagenet-100, and DomainNet-Real using DINOv2 VIT-g/14 features and Stanford Cars, FVGC Aircraft, Oxford-III Pets, and Places365 using OpenCLIP VIT-G/14 features (8 total settings). Due to computational costs, ProbCover was not evaluated on Places365, so the max value of cells in the ProbCover row/column is 7. (b) Blood Smear, Diabetic Retinopathy, IICBU Hela, and Skin cancer datasets using DINOv2 VIT-g/14 features (4 total settings). DropQuery outperforms all other methods on the natural image datasets and is a strong competitor to all other methods on the biomedical image datasets with statistical significance.
  • Figure 4: We illustrate the performance difference $\Delta_{ssl}$ between AL with and without label propagation for unlabeled instances. The results, averaged over 5 runs of 20 AL iterations on 4 natural image datasets, show that the suitability of foundation models for pseudo-label approaches is, although significant in the initial iterations of AL, hurts the performance of the active learner in later iterations.
  • Figure 5: Full AL curves for additional out-of-domain biomedical datasets.