Table of Contents
Fetching ...

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy

Edward Sanderson, Bogdan J. Matuszewski

TL;DR

This study evaluates self-supervised pretraining for backbones used in vision tasks in gastrointestinal endoscopy (GIE), comparing SSL pipelines (MoCo v3, Barlow Twins, MAE) with data sources (ImageNet-1k, Hyperkvasir-unlabelled) against supervised ImageNet-1k pretraining and random initialisation. It surveys ResNet50 and ViT-B backbones across five GIE tasks—anatomical landmark recognition, pathological finding characterisation, polyp detection, polyp segmentation, and monocular depth estimation in colonoscopy—using state-of-the-art decoders and standardized metrics. Results indicate SSL generally improves downstream performance, with ImageNet-1k SSL outperforming Hyperkvasir-unlabelled SSL except for depth estimation where data similarity matters more. The findings reveal architecture-task dependencies (ViT-B excels in segmentation and depth; ResNet50 in detection) and propose three general pretraining principles to guide GIE-specific SSL development, motivating further domain-focused SSL research and dataset creation.

Abstract

Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasir-unlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest three general principles. Firstly, that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining. Secondly, that self-supervised pretraining with ImageNet-1k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy. Thirdly, that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: \underline{github.com/ESandML/SSL4GIE}

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy

TL;DR

This study evaluates self-supervised pretraining for backbones used in vision tasks in gastrointestinal endoscopy (GIE), comparing SSL pipelines (MoCo v3, Barlow Twins, MAE) with data sources (ImageNet-1k, Hyperkvasir-unlabelled) against supervised ImageNet-1k pretraining and random initialisation. It surveys ResNet50 and ViT-B backbones across five GIE tasks—anatomical landmark recognition, pathological finding characterisation, polyp detection, polyp segmentation, and monocular depth estimation in colonoscopy—using state-of-the-art decoders and standardized metrics. Results indicate SSL generally improves downstream performance, with ImageNet-1k SSL outperforming Hyperkvasir-unlabelled SSL except for depth estimation where data similarity matters more. The findings reveal architecture-task dependencies (ViT-B excels in segmentation and depth; ResNet50 in detection) and propose three general pretraining principles to guide GIE-specific SSL development, motivating further domain-focused SSL research and dataset creation.

Abstract

Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasir-unlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest three general principles. Firstly, that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining. Secondly, that self-supervised pretraining with ImageNet-1k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy. Thirdly, that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: \underline{github.com/ESandML/SSL4GIE}
Paper Structure (30 sections, 30 equations, 11 figures, 9 tables)

This paper contains 30 sections, 30 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Targets (yellow bounding boxes) and predictions (green bounding boxes) for two randomly selected instances of the Kvasir-SEG test set. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
  • Figure 2: Targets and predictions for two randomly selected instances of the Kvasir-SEG test set. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
  • Figure 3: Targets and post-processed predictions for a randomly selected instance from each of the test videos for C3VD. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
  • Figure 4: Error maps for the post-processed predictions shown in Fig. \ref{['fig:example_depth']}, illustrating the absolute error with a larger value represented by a darker shade. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
  • Figure 5: Ranking of the performance of each model on each task, as measured by mF1 (anatomical landmark recognition and pathological finding characterisation), AP (polyp detection), mDice (polyp segmentation), and mRMSE (monocular depth estimation in colonoscopy), where a better rank is represented by a greater distance from the centre. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA. Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.
  • ...and 6 more figures