Table of Contents
Fetching ...

Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing

Daniya Najiha Abdul Kareem, Mustansar Fiaz, Noa Novershtern, Jacob Hanna, Hisham Cholakkal

TL;DR

A novel hierarchical encoder–decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3-D medical image segmentation and introduces a novel volumetric multi-layer perceptron (MLP)-mixer to capture the global dependencies at low-resolution feature representations, respectively.

Abstract

Volumetric medical image segmentation is a fundamental problem in medical image analysis where the objective is to accurately classify a given 3D volumetric medical image with voxel-level precision. In this work, we propose a novel hierarchical encoder-decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3D medical image segmentation. The proposed framework exploits local volume-based self-attention to encode the local dependencies at high resolution and introduces a novel volumetric MLP-mixer to capture the global dependencies at low-resolution feature representations, respectively. The proposed volumetric MLP-mixer learns better associations among volumetric feature representations. These explicit local and global feature representations contribute to better learning of the shape-boundary characteristics of the organs. Extensive experiments on three different datasets reveal that the proposed method achieves favorable performance compared to state-of-the-art approaches. On the challenging Synapse Multi-organ dataset, the proposed method achieves an absolute 3.82\% gain over the state-of-the-art approaches in terms of HD95 evaluation metrics {while a similar improvement pattern is exhibited in MSD Liver and Pancreas tumor datasets}. We also provide a detailed comparison between recent architectural design choices in the 2D computer vision literature by adapting them for the problem of 3D medical image segmentation. Finally, our experiments on the ZebraFish 3D cell membrane dataset having limited training data demonstrate the superior transfer learning capabilities of the proposed vMixer model on the challenging 3D cell instance segmentation task, where accurate boundary prediction plays a vital role in distinguishing individual cell instances.

Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing

TL;DR

A novel hierarchical encoder–decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3-D medical image segmentation and introduces a novel volumetric multi-layer perceptron (MLP)-mixer to capture the global dependencies at low-resolution feature representations, respectively.

Abstract

Volumetric medical image segmentation is a fundamental problem in medical image analysis where the objective is to accurately classify a given 3D volumetric medical image with voxel-level precision. In this work, we propose a novel hierarchical encoder-decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3D medical image segmentation. The proposed framework exploits local volume-based self-attention to encode the local dependencies at high resolution and introduces a novel volumetric MLP-mixer to capture the global dependencies at low-resolution feature representations, respectively. The proposed volumetric MLP-mixer learns better associations among volumetric feature representations. These explicit local and global feature representations contribute to better learning of the shape-boundary characteristics of the organs. Extensive experiments on three different datasets reveal that the proposed method achieves favorable performance compared to state-of-the-art approaches. On the challenging Synapse Multi-organ dataset, the proposed method achieves an absolute 3.82\% gain over the state-of-the-art approaches in terms of HD95 evaluation metrics {while a similar improvement pattern is exhibited in MSD Liver and Pancreas tumor datasets}. We also provide a detailed comparison between recent architectural design choices in the 2D computer vision literature by adapting them for the problem of 3D medical image segmentation. Finally, our experiments on the ZebraFish 3D cell membrane dataset having limited training data demonstrate the superior transfer learning capabilities of the proposed vMixer model on the challenging 3D cell instance segmentation task, where accurate boundary prediction plays a vital role in distinguishing individual cell instances.

Paper Structure

This paper contains 19 sections, 3 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (a) Overview of the proposed vMixer framework with hierarchical encoder-decoder architecture. The focus of our design is to explicitly capture the local and global feature dependencies for accurate segmentation. Our framework takes 3D images as input and employs local volume self-attention (LVSA) block to explicitly learn the local dependencies at high resolution ($E1$, $D3$). The $E1$ features are downsampled and passed to the proposed global mixer block to explicitly learn the global dependencies. In the decoder, the features are first upsampled and then fused with the encoder features through a skip connection. We employ global volume mixer blocks at the first two decoder stages ($D1$ and $D2$) and a LVSA block at the last stage of the decoder ($D3$). The final decoder features are fed to an expanding layer for producing the final segmentation mask. (b) Presents the LVSA block which comprises of local volume-based multi-head self-attention (LV-MSA) layer followed by a shifted local volume-based multi-head self-attention (SLV-MSA) layer. (c) Shows the structure of the volumetric MLP-mixer layer used in the GVM block. Each GVM block comprises two MLP-mixer layers. The volumetric MLP-mixer layer performs token mixing and channel mixing operations on the input volumetric tokens.
  • Figure 2: Qualitative comparison on the Synapse multi-organ dataset. Our method provides improved segmentation by accurately detecting the boundaries of the organ.
  • Figure 3: Qualitative comparison of ablation experiments on Synapse multi-organ dataset. In a closer inspection, it can be seen that our method (LVSA (stage 1) and GVM (stage 2, stage 3, stage 4)) performs better than (i) GVM (Stages 1-4) and (ii) LVSA (Stages 1-4). Row 1 corresponds to the cross-section of the left kidney and row 2 shows a cross-section containing portions of the liver (light pink), spleen (magenta), and stomach (red). It can be clearly observed that our method has better shape preservation capabilities compared to other settings.
  • Figure 4: Qualitative results on the MSD pancreas tumour (left) and MSD Liver tumour datasets. Our vMixer provides accurate segmentation of boundary regions.
  • Figure 5: Adaptation of different network architecture blocks for 3D medical image segmentation. The depth-wise convolution (DWC) and depth- wise scaling (DCS) based (a) ConvNeXt convnext, (b) FocalNet f_net (see Fig. \ref{['fig:q2']}-b), and (d) DynaMixer dynamixer (see Fig. \ref{['fig:q2']}-a) operate on input 3D volume of size $B\times C \times H\times W \times D$. The (c) Swin Transformer liu2021swin, (e) Swin Mixer liu2021swin, and (f) GVM blocks reshape the 3D volume to $B\times HWD \times C$ dimensional features. Here, B: batch, C: channel, H: height, W: width, and D: depth dimensions, respectively.
  • ...and 3 more figures