Table of Contents
Fetching ...

Data Efficient Contrastive Learning in Histopathology using Active Sampling

Tahsin Reasat, Asif Sushmit, David S. Smith

TL;DR

This work addresses data-inefficiency in contrastive learning for histopathology by integrating active sampling with a lightweight proxy model. It extends SimCLR with an iterative loop that uses a proxy to select the most informative unlabeled samples, via uncertainty or coreset strategies, reducing labeled data requirements while maintaining feature quality. On the Kather-19 dataset, the method achieves up to $93\%$ fewer samples and up to $62\%$ faster training to reach the same performance, enabling practical deployment in high-resolution pathology settings. The approach focuses learning on tumor-relevant examples and provides a scalable framework for efficient SSL in medical imaging.

Abstract

Deep learning (DL) based diagnostics systems can provide accurate and robust quantitative analysis in digital pathology. These algorithms require large amounts of annotated training data which is impractical in pathology due to the high resolution of histopathological images. Hence, self-supervised methods have been proposed to learn features using ad-hoc pretext tasks. The self-supervised training process uses a large unlabeled dataset which makes the learning process time consuming. In this work, we propose a new method for actively sampling informative members from the training set using a small proxy network, decreasing sample requirement by 93% and training time by 62% while maintaining the same performance of the traditional self-supervised learning method. The code is available on https://github.com/Reasat/data_efficient_cl

Data Efficient Contrastive Learning in Histopathology using Active Sampling

TL;DR

This work addresses data-inefficiency in contrastive learning for histopathology by integrating active sampling with a lightweight proxy model. It extends SimCLR with an iterative loop that uses a proxy to select the most informative unlabeled samples, via uncertainty or coreset strategies, reducing labeled data requirements while maintaining feature quality. On the Kather-19 dataset, the method achieves up to fewer samples and up to faster training to reach the same performance, enabling practical deployment in high-resolution pathology settings. The approach focuses learning on tumor-relevant examples and provides a scalable framework for efficient SSL in medical imaging.

Abstract

Deep learning (DL) based diagnostics systems can provide accurate and robust quantitative analysis in digital pathology. These algorithms require large amounts of annotated training data which is impractical in pathology due to the high resolution of histopathological images. Hence, self-supervised methods have been proposed to learn features using ad-hoc pretext tasks. The self-supervised training process uses a large unlabeled dataset which makes the learning process time consuming. In this work, we propose a new method for actively sampling informative members from the training set using a small proxy network, decreasing sample requirement by 93% and training time by 62% while maintaining the same performance of the traditional self-supervised learning method. The code is available on https://github.com/Reasat/data_efficient_cl
Paper Structure (16 sections, 6 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 6 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Active learning loop settles2009active. An oracle annotates the most informative samples which is used to refine model predictions.
  • Figure 2: \ref{['fig_simclr']}) The SimCLR framework. A neural network model $f(\cdot)$ minimizes the distance (maximizes agreement) between feature representation of two augmented views $\tilde{x}_i$ and $\tilde{x}_j$ of the same image. \ref{['fig:proposed_framework']}) The proposed framework speeds up the contrastive sampling process by actively selecting informative samples with the help of a simple proxy model. \ref{['fig_proxy_model']}) Structure of the simple proxy model which is a fully connected network with a depth of one.
  • Figure 3: The tissue types present in Kather-19 dataset.
  • Figure 4: Comparison of sampling strategies. The active sampling methods (uncertainty and coreset) required less samples to reach the performance of the CL model trained on a full set of images (benchmark).
  • Figure 5: The change of average entropy in each iteration across different tissue types. Active sampling decreased the average entropy of the tumor samples with each iterations.
  • ...and 3 more figures