Data Efficient Contrastive Learning in Histopathology using Active Sampling
Tahsin Reasat, Asif Sushmit, David S. Smith
TL;DR
This work addresses data-inefficiency in contrastive learning for histopathology by integrating active sampling with a lightweight proxy model. It extends SimCLR with an iterative loop that uses a proxy to select the most informative unlabeled samples, via uncertainty or coreset strategies, reducing labeled data requirements while maintaining feature quality. On the Kather-19 dataset, the method achieves up to $93\%$ fewer samples and up to $62\%$ faster training to reach the same performance, enabling practical deployment in high-resolution pathology settings. The approach focuses learning on tumor-relevant examples and provides a scalable framework for efficient SSL in medical imaging.
Abstract
Deep learning (DL) based diagnostics systems can provide accurate and robust quantitative analysis in digital pathology. These algorithms require large amounts of annotated training data which is impractical in pathology due to the high resolution of histopathological images. Hence, self-supervised methods have been proposed to learn features using ad-hoc pretext tasks. The self-supervised training process uses a large unlabeled dataset which makes the learning process time consuming. In this work, we propose a new method for actively sampling informative members from the training set using a small proxy network, decreasing sample requirement by 93% and training time by 62% while maintaining the same performance of the traditional self-supervised learning method. The code is available on https://github.com/Reasat/data_efficient_cl
