Table of Contents
Fetching ...

Going Beyond U-Net: Assessing Vision Transformers for Semantic Segmentation in Microscopy Image Analysis

Illia Tsiporenko, Pavel Chizhov, Dmytro Fishman

TL;DR

A comparative analysis of transformer models for advancing biomedical image segmentation demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.

Abstract

Segmentation is a crucial step in microscopy image analysis. Numerous approaches have been developed over the past years, ranging from classical segmentation algorithms to advanced deep learning models. While U-Net remains one of the most popular and well-established models for biomedical segmentation tasks, recently developed transformer-based models promise to enhance the segmentation process of microscopy images. In this work, we assess the efficacy of transformers, including UNETR, the Segment Anything Model, and Swin-UPerNet, and compare them with the well-established U-Net model across various image modalities such as electron microscopy, brightfield, histopathology, and phase-contrast. Our evaluation identifies several limitations in the original Swin Transformer model, which we address through architectural modifications to optimise its performance. The results demonstrate that these modifications improve segmentation performance compared to the classical U-Net model and the unmodified Swin-UPerNet. This comparative analysis highlights the promise of transformer models for advancing biomedical image segmentation. It demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.

Going Beyond U-Net: Assessing Vision Transformers for Semantic Segmentation in Microscopy Image Analysis

TL;DR

A comparative analysis of transformer models for advancing biomedical image segmentation demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.

Abstract

Segmentation is a crucial step in microscopy image analysis. Numerous approaches have been developed over the past years, ranging from classical segmentation algorithms to advanced deep learning models. While U-Net remains one of the most popular and well-established models for biomedical segmentation tasks, recently developed transformer-based models promise to enhance the segmentation process of microscopy images. In this work, we assess the efficacy of transformers, including UNETR, the Segment Anything Model, and Swin-UPerNet, and compare them with the well-established U-Net model across various image modalities such as electron microscopy, brightfield, histopathology, and phase-contrast. Our evaluation identifies several limitations in the original Swin Transformer model, which we address through architectural modifications to optimise its performance. The results demonstrate that these modifications improve segmentation performance compared to the classical U-Net model and the unmodified Swin-UPerNet. This comparative analysis highlights the promise of transformer models for advancing biomedical image segmentation. It demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.
Paper Structure (26 sections, 1 equation, 3 figures, 6 tables)

This paper contains 26 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: We present example crops of the images from (a) Electron Microscopy dataset em, (b) MoNuSeg dataset monuseg_1monuseg_2, (c) Seven Cell Lines dataset microscopy_intro, and (d) LIVECell dataset livecell
  • Figure 2: Representation of Swin-UPerNet architecture, which consists of Swin Transformer (blue blocks) and the UPerNet decoder (green blocks). Orange dotted rectangles provide an overview of our proposed modifications to the architecture of the model. Conv denotes a convolutional block, which is made of a convolutional layer, batch normalisation, and ReLU activation. Deconv denotes transposed convolutional operation. The circle with a line denotes an addition operation, followed by a convolutional operation with kernel size $3\times3$.
  • Figure 3: Predicted segmentation masks of Swin-S-TB-Skip, UNETR, U-Net, and Segment Anything Model (utilising bounding box and point prompts and enabling automatic segmentation). The white contour represents the outline of the ground truth mask. The colour overlay represents the predicted segmentation mask of the model: green colour for Swin-S-TB-Skip, red colour for UNETR, blue colour for U-Net, and purple colour for SAM. We made the image from MoNuSeg dataset grayscale for the purpose of better visibility of predicted segmentation masks.