Table of Contents
Fetching ...

TBConvL-Net: A Hybrid Deep Learning Architecture for Robust Medical Image Segmentation

Shahzaib Iqbal, Tariq M. Khan, Syed S. Naqvi, Asim Naveed, Erik Meijering

TL;DR

A novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers, that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory networks and vision transformers.

Abstract

Deep learning has shown great potential for automated medical image segmentation to improve the precision and speed of disease diagnostics. However, the task presents significant difficulties due to variations in the scale, shape, texture, and contrast of the pathologies. Traditional convolutional neural network (CNN) models have certain limitations when it comes to effectively modelling multiscale context information and facilitating information interaction between skip connections across levels. To overcome these limitations, a novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers. Our proposed model, named TBConvL-Net, involves a hybrid network that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory (LSTM) networks and vision transformers (ViT). This enables the model to capture contextual channel relationships in the data and account for the uncertainty of segmentation over time. Additionally, we introduce a novel composite loss function that considers both the segmentation robustness and the boundary agreement of the predicted output with the gold standard. Our proposed model shows consistent improvement over the state of the art on ten publicly available datasets of seven different medical imaging modalities.

TBConvL-Net: A Hybrid Deep Learning Architecture for Robust Medical Image Segmentation

TL;DR

A novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers, that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory networks and vision transformers.

Abstract

Deep learning has shown great potential for automated medical image segmentation to improve the precision and speed of disease diagnostics. However, the task presents significant difficulties due to variations in the scale, shape, texture, and contrast of the pathologies. Traditional convolutional neural network (CNN) models have certain limitations when it comes to effectively modelling multiscale context information and facilitating information interaction between skip connections across levels. To overcome these limitations, a novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers. Our proposed model, named TBConvL-Net, involves a hybrid network that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory (LSTM) networks and vision transformers (ViT). This enables the model to capture contextual channel relationships in the data and account for the uncertainty of segmentation over time. Additionally, we introduce a novel composite loss function that considers both the segmentation robustness and the boundary agreement of the predicted output with the gold standard. Our proposed model shows consistent improvement over the state of the art on ten publicly available datasets of seven different medical imaging modalities.
Paper Structure (14 sections, 29 equations, 11 figures, 12 tables)

This paper contains 14 sections, 29 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Block diagram of the TBConvL-Net architecture, showing its key components: encoder, decoder, and skip connections with BConvLSTM and Transformer layers.
  • Figure 2: Design of the ConvLSTM block, a solution to the spatial correlation shortcomings of traditional LSTM models, achieved by the incorporation of convolutional operations in the input-to-state and state-to-state transitions. The architecture includes a memory cell ($M_c$), an output gate ($\phi$), an input gate ($i$) and a forget gate ($f$), with these gates serving as control mechanisms to access, update and erase the content of the memory cells. For both the hidden and the input states in the block, 2D convolution masks are used, with Hadamard and convolutional operations symbolised by $\otimes$ and $\circledast$, respectively. The input and hidden state tensors are indicated by $\Im_t$ and $\wp_t$, respectively, while the biases associated with the memory cell, the output gate, the input gate and the forget gate are denoted as $\beta_{M_c}$, $\beta_{\phi}$, $\beta_i$, and $\beta_f$, respectively.
  • Figure 3: Lightweight swin transformer architecture. The input RGB images are divided into non-overlapping patches, transformed into tokens, and projected into an arbitrary dimension ($d$). Transformer blocks with modified self-attention computations process these tokens, creating a hierarchical representation. The lightweight version replaces the conventional multihead self-attention (MSA) module with a shifted window-based MSA module to reduce computational complexity while preserving core functionality. Efficiency is further improved by computing self-attention within local windows, scaling linearly with a fixed size of $N$.
  • Figure 4: Visual results of the proposed TBConvL-Net using the different loss functions for thyroid nodule segmentation in the DDTI dataset.
  • Figure 5: Example segmentation results of TBConvL-Net on the skin lesions dataset ISIC 2017. From left to right, the columns show the input images, the ground-truth masks, the segmentation results of TBConvL-Net, and the results of ARU-GD maji2022attention, UNet++ zhou2018unet++, U-Net ronneberger2015u, BCDU-Net azad2019bi, and Swin-Unet cao2023swin, respectively. True-positive pixels are depicted in green, false-positive pixels in red, and false-negative pixels in blue.
  • ...and 6 more figures