Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

Pranav Singh; Raviteja Chukkapalli; Shravan Chaudhari; Luoyao Chen; Mei Chen; Jinqian Pan; Craig Smuda; Jacopo Cirrone

Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

Pranav Singh, Raviteja Chukkapalli, Shravan Chaudhari, Luoyao Chen, Mei Chen, Jinqian Pan, Craig Smuda, Jacopo Cirrone

TL;DR

The paper tackles annotation bottlenecks in medical imaging by proposing the S4MI pipeline that combines self-supervised and semi-supervised learning for annotation-efficient classification and segmentation. It demonstrates that self-supervised pretraining (notably CASS) surpasses supervised transfer learning for classification, while semi-supervised segmentation with unlabeled data yields superior IoU compared to fully supervised methods using 50% fewer labels. Across three medical imaging datasets, the approach achieves robust performance with reduced labeling effort, supported by open-source code for reproducibility. This work advances practical machine supervision in healthcare by reducing labeling costs while maintaining or improving accuracy.

Abstract

Advancements in clinical treatment are increasingly constrained by the limitations of supervised learning techniques, which depend heavily on large volumes of annotated data. The annotation process is not only costly but also demands substantial time from clinical specialists. Addressing this issue, we introduce the S4MI (Self-Supervision and Semi-Supervision for Medical Imaging) pipeline, a novel approach that leverages advancements in self-supervised and semi-supervised learning. These techniques engage in auxiliary tasks that do not require labeling, thus simplifying the scaling of machine supervision compared to fully-supervised methods. Our study benchmarks these techniques on three distinct medical imaging datasets to evaluate their effectiveness in classification and segmentation tasks. Notably, we observed that self supervised learning significantly surpassed the performance of supervised methods in the classification of all evaluated datasets. Remarkably, the semi-supervised approach demonstrated superior outcomes in segmentation, outperforming fully-supervised methods while using 50% fewer labels across all datasets. In line with our commitment to contributing to the scientific community, we have made the S4MI code openly accessible, allowing for broader application and further development of these methods.

Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 5 figures, 1 table)

This paper contains 22 sections, 3 equations, 5 figures, 1 table.

Introduction
Data
Methods
Classification
Classification pipeline additional details
Segmentation
Approach specific additional details
Semi-Supervised Approach :
Unsupervised Approach :
Supervised Approach :
Common Implementations:
Data-Preprocessing
Results
Classification
Saliency Maps
...and 7 more sections

Figures (5)

Figure 1: Applying CASS: In this figure we detail the steps involved in applying CASS, the best-performing machine supervision approach for classification singh2022cass. We conducted experiments on three datasets listed on the left side: the Dermatomyositis, ISIC-2017, and Dermofit datasets. For training on a dataset, we initialize both the networks with their ImageNet weights and select one dataset at a time. To train CASS, we start by label-free pretraining as illustrated in Part (b) of Fig. \ref{['fig:cls_pipeline']}. During pre-training, a CNN and a Transformer are trained simultaneously. In the case of the Dermatomyositis dataset, the finetuning is multi-label, while in the case of the ISIC-2017 and the Dermofit dataset, it is multi-class. This pre-training in (b) is followed by labeled fine-tuning as shown in Part (c), where image are fine-tuned one at a time.
Figure 2: Visualization of saliency maps on a random sample from the ISIC-2017 dataset, left (a, b): data (input image), middle (c, d): saliency map from CASS, and right (e,f): saliency map from DINO with ViTB/16 at the top and ResNet-50 at the bottom. DINO’s saliency map exhibits notable stochasticity, displaying a lack of strong correlation with the specific pathology under consideration. Conversely, in the case of CASS, the saliency maps demonstrate a significantly more aligned with the pathology of interest both for CNN as well as the Transformer.
Figure 3: In this figure we present the semi-supervised pipeline as described in \ref{['semi-sl-section']}. Similar to the classification experiments, we evaluate the segmentation pipeline on three challenging medical image segmentation datasets - the Dermatomyositis, ISIC-2017, and the Dermofit dataset. We use one dataset at a time to train the semi-supervised architecture. Unlike the classification pipeline, semi-supervised learning involves simultaneous learning from labeled and unlabeled data. In Part (b) of Fig. \ref{['fig:seg_pipeline']}, we start by training the data in an unlabeled fashion and during the same iteration labeled data is also passed to the architecture as shown in part (c) of the figure. Predictions from passing inputs of the labeled images yield learned predictions as shown in Part (e). Unsupervised loss is then calculated by comparing the outputs of the CNN and the Transformer (as shown in Part(d) of this figure) using the $\mathcal{L}_{\text{Unsupervised}}$ in Section \ref{['semi-sl-section']}. This unsupervised loss is then added to the supervised loss denoted by $\mathcal{L}_{\text{Supervised}}$in Section \ref{['semi-sl-section']}. The supervised loss is calculated against the ground truth as shown in part (e) and (f) in this figure.
Figure 4: In this figure, we depict the Unsupervised Segmentation Approach: PiCIE pipeline picie2021. $View_1$ and $View_2$ represent two photometrically transformed views of the input image, whereas $View_2^{1}$ represents the geometric transformation of $View_2$. Cross-view training is then used to train the architecture shared between the two views (parameterized by $\theta$ in the figure); we have expanded further on this in Section \ref{['unsup']}.
Figure 5: In this figure, we present the results for the Dermatomyositis dataset, ISIC-2017, and the Dermofit dataset in panels (a), (b), and (c), respectively. We compare the segmentation performance of Full, Semi, and Unsupervised architectures across these datasets, considering different percentages of label fraction (x-axis). Performance is evaluated using the Intersection Over Union (IoU), depicted on the y-axis, to compare results across all three datasets. IoU values range from 0 to 1. The blue bar represents the performance of the unsupervised approach, PiCIE picie2021, which, by definition, does not require any labels for fine-tuning. Consequently, we present results for PiCIE using 0% label fractions. Remarkably, we observe that the semi-supervised approach surpasses the fully-supervised approach by requiring 50% fewer labels per image across all three datasets.

Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

TL;DR

Abstract

Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)