Table of Contents
Fetching ...

Systematic comparison of semi-supervised and self-supervised learning for medical image classification

Zhe Huang, Ruijie Jiang, Shuchin Aeron, Michael C. Hughes

TL;DR

This study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets and provides valuable best practices to resource-constrained practitioners: hy-perparameter tuning is effective, and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.

Abstract

In typical medical image classification problems, labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on an equal footing. Furthermore, past benchmarks often handle hyperparameter tuning suboptimally. First, they may not tune hyperparameters at all, leading to underfitting. Second, when tuning does occur, it often unrealistically uses a labeled validation set that is much larger than the training set. Therefore currently published rankings might not always corroborate with their practical utility This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so, when all methods are tuned well, which self- or semi-supervised methods achieve the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ GPU hours of computation, we provide valuable best practices to resource-constrained practitioners: hyperparameter tuning is effective, and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.

Systematic comparison of semi-supervised and self-supervised learning for medical image classification

TL;DR

This study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets and provides valuable best practices to resource-constrained practitioners: hy-perparameter tuning is effective, and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.

Abstract

In typical medical image classification problems, labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on an equal footing. Furthermore, past benchmarks often handle hyperparameter tuning suboptimally. First, they may not tune hyperparameters at all, leading to underfitting. Second, when tuning does occur, it often unrealistically uses a labeled validation set that is much larger than the training set. Therefore currently published rankings might not always corroborate with their practical utility This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so, when all methods are tuned well, which self- or semi-supervised methods achieve the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ GPU hours of computation, we provide valuable best practices to resource-constrained practitioners: hyperparameter tuning is effective, and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.
Paper Structure (36 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Balanced accuracy over time profiles of semi- and self-supervised methods across all 4 datasets. At each time, we report the test set bal. acc. of each method (mean over 5 trials of Alg. \ref{['alg:hyperparam_tuning']}). Best viewed electronically. Top Row: On these 3 datasets, we compare all 6 semi- and 7 self- methods (see legend in lower right, citations in Sec. \ref{['methods']}), to 3 labeled-set-only baselines. Bottom Row: On larger AIROGS dataset, we compare selected methods representative of the best in the top row. Thin lines show final performance to ease comparison. From these charts, we suggest that our unified training and tuning (Alg. \ref{['alg:hyperparam_tuning']}) is effective, as all methods show gains over time.
  • Figure 2: Performance variation across independent trials over time. Intervals visualize the lowest and highest balanced accuracy of 5 separate trials of Alg. \ref{['alg:hyperparam_tuning']}. Y-axis shows balanced accuracy. X-axis from left to right shows CoMatch, FlexMatch, FixMatch, MixMatch, Mean-Teacher, Pseudo-labeling, SwAV, MoCo, SimCLR, BYOL and SimSiam. More results in Appendix \ref{['app:results']}.
  • Figure B.1: Balanced accuracy on test set over time for semi- and self-supervised methods, with (left) and without (right) initial weight pretraining on ImageNet. Curves represent mean of each method at each time over 5 trials of Alg. \ref{['alg:hyperparam_tuning']}.
  • Figure B.2: Validation-set accuracy over time profiles of semi- and self-supervised methods on 4 datasets (panels a-d). All curves here by definition must be monotonically increasing. The increasing profiles here on the validation set translate to similar trends in test set performance in Fig. \ref{['fig:test_performance_vs_time']}, indicating successful generalization.
  • Figure B.3: Profiles of several clinically-relevant performance metrics over time on the AIROGS test set.Top row: ResNet-18. Bottom row: ResNet-50. Columns, left-to-right: Balanced Accuracy, AUROC, Partial AUROC focused on the 90% - 100% specificity regime, and sensitivity at 95% specificity. At each time, we report mean of each method over 5 trials of Alg. \ref{['alg:hyperparam_tuning']}.
  • ...and 1 more figures