Table of Contents
Fetching ...

Transfusion: Understanding Transfer Learning for Medical Imaging

Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio

TL;DR

This work challenges the prevailing reliance on ImageNet-pretrained models for medical imaging by benchmarking standard ImageNet architectures against lightweight CNNs on two large medical tasks. Through representational analyses and weight-transfer experiments, it shows that transfer learning provides limited performance gains and that overparameterized models are often unnecessary. SVCCA reveals that pretrained representations diverge mainly in early layers, with substantial feature reuse confined to the bottom of the network. The authors further demonstrate feature-independent benefits from weight scaling and explore hybrid transfer strategies that maintain performance while enabling faster convergence and more efficient model exploration.

Abstract

Transfer learning from natural image datasets, particularly ImageNet, using standard large models and corresponding pretrained weights has become a de-facto method for deep learning applications to medical imaging. However, there are fundamental differences in data sizes, features and task specifications between natural image classification and the target medical tasks, and there is little understanding of the effects of transfer. In this paper, we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures. Investigating the learned representations and features, we find that some of the differences from transfer learning are due to the over-parametrization of standard models rather than sophisticated feature reuse. We isolate where useful feature reuse occurs, and outline the implications for more efficient model exploration. We also explore feature independent benefits of transfer arising from weight scalings.

Transfusion: Understanding Transfer Learning for Medical Imaging

TL;DR

This work challenges the prevailing reliance on ImageNet-pretrained models for medical imaging by benchmarking standard ImageNet architectures against lightweight CNNs on two large medical tasks. Through representational analyses and weight-transfer experiments, it shows that transfer learning provides limited performance gains and that overparameterized models are often unnecessary. SVCCA reveals that pretrained representations diverge mainly in early layers, with substantial feature reuse confined to the bottom of the network. The authors further demonstrate feature-independent benefits from weight scaling and explore hybrid transfer strategies that maintain performance while enabling faster convergence and more efficient model exploration.

Abstract

Transfer learning from natural image datasets, particularly ImageNet, using standard large models and corresponding pretrained weights has become a de-facto method for deep learning applications to medical imaging. However, there are fundamental differences in data sizes, features and task specifications between natural image classification and the target medical tasks, and there is little understanding of the effects of transfer. In this paper, we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures. Investigating the learned representations and features, we find that some of the differences from transfer learning are due to the over-parametrization of standard models rather than sophisticated feature reuse. We isolate where useful feature reuse occurs, and outline the implications for more efficient model exploration. We also explore feature independent benefits of transfer arising from weight scalings.

Paper Structure

This paper contains 18 sections, 1 equation, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Example images from the ImageNet, the retinal fundus photographs, and the CheXpert datasets, respectively. The fundus photographs and chest x-rays have much higher resolution than the ImageNet images, and are classified by looking for small local variations in tissue.
  • Figure 2: Pretrained weights give rise to different hidden representations than training from random initialization for large models. We compute CCA similarity scores between representations learned using pretrained weights and those from random initialization. We do this for the top two layers (or stages for Resnet, Inception) and average the scores, plotting the results in orange. In blue is a baseline similarity score, for representations trained from different random initializations. We see that representations learned from random initialization are more similar to each other than those learned from pretrained weights for larger models, with less of a distinction for smaller models.
  • Figure 3: Per-layer CCA similarities before and after training on medical task. For all models, we see that the lowest layers are most similar to their initializations, and this is especially evident for Resnet50 (a large model). We also see that feature reuse is mostly restricted to the bottom two layers (stages for Resnet) --- the only place where similarity with initialization is significantly higher for pretrained weights (grey dotted lines shows the difference in similarity scores between pretrained and random initialization).
  • Figure 4: Large models move less through training at lower layers: similarity at initialization is highly correlated with similarity at convergence for large models. We plot CCA similarity of Resnet (conv1) initialized randomly and with pretrained weights at (i) initialization, against (ii) CCA similarity of the converged representations (top row second from left.) We also do this for two different random initializations (top row, left). In both cases (even for random initialization), we see a surprising, strong correlation between similarity at initialization and similarity after convergence ($R^2 = 0.75, 0.84$). This is not the case for the smaller CBR-Small model, illustrating the overparametrization of Resnet for the task. Higher must likely change much more for good task performance.
  • Figure 5: Visualization of conv1 filters shows the remains of initialization after training in Resnet, and the lack of and erasing of Gabor filters in CBR-Small. We visualize the filters before and after training from random initialization and pretrained weights for Resnet (top row) and CBR-Small (bottom row). Comparing the similarity of (e) to (f) and (g) to (h) shows the limited movement of Resnet through training, while CBR-Small changes much more. We see that CBR does not learn Gabor filters when trained from scratch (f), and also erases some of the pretrained Gabors (compare (g) to (h).)
  • ...and 14 more figures