Table of Contents
Fetching ...

A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images

Xin Li, Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Amal Youssef, Yalin Wang

TL;DR

The paper addresses label scarcity and high data requirements in medical image analysis by introducing a CNN-based BERT-style self-supervised pre-training framework implemented on the lightweight nn-MobileNet. It combines masking-based self-supervision with sparse convolution and a UNet decoder to learn hierarchical, localized representations from unlabeled fundus images. Key details include a downsampling hierarchy with five levels and a masked autoencoder-like objective, evaluated on Alzheimer's and Parkinson's disease identification and MICCAI MMAC datasets, achieving state-of-the-art or competitive results while using substantially fewer unlabeled images than some ViT baselines. The approach demonstrates improved efficiency and data utilization, with interpretable heatmaps aligning with known retinal biomarkers and suggesting practical impact for disease screening with limited labeled data.

Abstract

In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction of Vision Transformers (ViT) and self-supervised learning provides a pre-training strategy that utilizes abundant unlabeled data, effectively alleviating the label acquisition challenge while broadening the breadth of data utilization. However, ViT's high computational density and substantial demand for computing power, coupled with the lack of localization characteristics of its operations on image patches, limit its efficiency and applicability in many application scenarios. In this study, we employ nn-MobileNet, a lightweight CNN framework, to implement a BERT-style self-supervised learning approach. We pre-train the network on the unlabeled retinal fundus images from the UK Biobank to improve downstream application performance. We validate the results of the pre-trained model on Alzheimer's disease (AD), Parkinson's disease (PD), and various retinal diseases identification. The results show that our approach can significantly improve performance in the downstream tasks. In summary, this study combines the benefits of CNNs with the capabilities of advanced self-supervised learning in handling large-scale unlabeled data, demonstrating the potential of CNNs in the presence of label scarcity.

A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images

TL;DR

The paper addresses label scarcity and high data requirements in medical image analysis by introducing a CNN-based BERT-style self-supervised pre-training framework implemented on the lightweight nn-MobileNet. It combines masking-based self-supervision with sparse convolution and a UNet decoder to learn hierarchical, localized representations from unlabeled fundus images. Key details include a downsampling hierarchy with five levels and a masked autoencoder-like objective, evaluated on Alzheimer's and Parkinson's disease identification and MICCAI MMAC datasets, achieving state-of-the-art or competitive results while using substantially fewer unlabeled images than some ViT baselines. The approach demonstrates improved efficiency and data utilization, with interpretable heatmaps aligning with known retinal biomarkers and suggesting practical impact for disease screening with limited labeled data.

Abstract

In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction of Vision Transformers (ViT) and self-supervised learning provides a pre-training strategy that utilizes abundant unlabeled data, effectively alleviating the label acquisition challenge while broadening the breadth of data utilization. However, ViT's high computational density and substantial demand for computing power, coupled with the lack of localization characteristics of its operations on image patches, limit its efficiency and applicability in many application scenarios. In this study, we employ nn-MobileNet, a lightweight CNN framework, to implement a BERT-style self-supervised learning approach. We pre-train the network on the unlabeled retinal fundus images from the UK Biobank to improve downstream application performance. We validate the results of the pre-trained model on Alzheimer's disease (AD), Parkinson's disease (PD), and various retinal diseases identification. The results show that our approach can significantly improve performance in the downstream tasks. In summary, this study combines the benefits of CNNs with the capabilities of advanced self-supervised learning in handling large-scale unlabeled data, demonstrating the potential of CNNs in the presence of label scarcity.

Paper Structure

This paper contains 6 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The ViT on the left panel can process non-masked patches without any changes since it can process variable-length sequences, while the CNN on the right panel cannot skip masks for convolutions. Simply adopting masking in CNN may lead to performance degradation. This work adopts a novel solution to address this problem.
  • Figure 2: The detailed architecture of the nn-MobileNet and its ILRB design. The nn-MobileNet achieved superior results in retinal imaging research zhunnmobilenet. The current self-supervised learning scheme further enhances its performance.
  • Figure 3: Illustration of our pre-training workflow. We start by masking all the images randomly (Step 1). Next, we mask the feature maps adapted to different resolutions for the CNN encoder and decoder (Step 2). Finally, we perform sparse convolution on the masked image and restore the image through the decoder (Step 3).
  • Figure 4: Illustration of the quality control pipeline as we filter fundus image data of AD patients, PD patients, and normal control subjects from the UK Biobank dataset.
  • Figure 5: Heat maps for AD (the first row) and PD (the second row) demonstrate our work achieves valuable biomarkers consistent with prior research Dumitrascu:Cells2021.