Table of Contents
Fetching ...

MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation

De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zhi-Chao Lai, Zeng-Guang Hou

TL;DR

MOSformer tackles inter-slice information fusion in 2.5D medical image segmentation by using dual encoders with a momentum update and a multi-scale inter-slice fusion transformer (IF-Trans). This design yields distinguishable yet consistent slice features and effective cross-slice context modeling, achieving state-of-the-art results on Synapse, ACDC, and AMOS. The approach offers favorable model complexity and robustness on anisotropic data, highlighting the practicality of momentum-encoder based inter-slice fusion for 3D segmentation. It paves the way for applying inter-slice fusion strategies to other clinical image analysis tasks.

Abstract

Medical image segmentation takes an important position in various clinical applications. 2.5D-based segmentation models bridge the computational efficiency of 2D-based models with the spatial perception capabilities of 3D-based models. However, existing 2.5D-based models primarily adopt a single encoder to extract features of target and neighborhood slices, failing to effectively fuse inter-slice information, resulting in suboptimal segmentation performance. In this study, a novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information from multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain consistent slice representations. Moreover, an inter-slice fusion transformer (IF-Trans) module is developed to fuse inter-slice multi-scale features. MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate MOSformer's competitiveness in medical image segmentation.

MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation

TL;DR

MOSformer tackles inter-slice information fusion in 2.5D medical image segmentation by using dual encoders with a momentum update and a multi-scale inter-slice fusion transformer (IF-Trans). This design yields distinguishable yet consistent slice features and effective cross-slice context modeling, achieving state-of-the-art results on Synapse, ACDC, and AMOS. The approach offers favorable model complexity and robustness on anisotropic data, highlighting the practicality of momentum-encoder based inter-slice fusion for 3D segmentation. It paves the way for applying inter-slice fusion strategies to other clinical image analysis tasks.

Abstract

Medical image segmentation takes an important position in various clinical applications. 2.5D-based segmentation models bridge the computational efficiency of 2D-based models with the spatial perception capabilities of 3D-based models. However, existing 2.5D-based models primarily adopt a single encoder to extract features of target and neighborhood slices, failing to effectively fuse inter-slice information, resulting in suboptimal segmentation performance. In this study, a novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information from multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain consistent slice representations. Moreover, an inter-slice fusion transformer (IF-Trans) module is developed to fuse inter-slice multi-scale features. MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate MOSformer's competitiveness in medical image segmentation.
Paper Structure (18 sections, 7 equations, 9 figures, 8 tables)

This paper contains 18 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between conventional feature extraction paradigm of 2.5D-based segmentation models and our proposed paradigm. (a) Conventional approaches use a single encoder to extract features of all slices. Therefore, target slices and neighborhood slices share the same feature space. (b) Our proposed paradigm adopts dual encoders to extract features of target and neighborhood slices, respectively. Momentum update is used in the neighborhood slice encoder. Hence, feature spaces of target and neighborhood slices are distinguishable and consistent. (Mom.: Momentum.)
  • Figure 2: The architecture of MOSformer. It comprises dual encoders: a momentum encoder that extracts features of the neighborhood slice and an encoder that extracts features from the target slice. IF-Trans is designed to perform inter-slice fusion independently at different scales. The fused features are then fed into a CNN decoder to produce segmentation maps for the target slices. Cylinders in yellow, pink, blue, green, and purple denote feature maps produced by the momentum encoder, the encoder, the IF-Trans, the upsampling operators, and the decoder blocks, i.e.,(Conv+BN+ReLU) * 2, respectively.
  • Figure 3: Schematic of inter-slice fusion transformer (IF-Trans) module. The neighborhood slice number is set to $1$ in this figure, consistent with our default model configuration. It has two successive IF-Trans with different window partitioning configurations. The colored circles indicate feature pixels. The window-based self-attention is expanded to the inter-slice dimension, promoting target slice feature pixels to learn intra- and inter-slice contexts. The black and red arrows denote the data flow within the IF-Trans Module and the fusion process of the IF-Trans Module.
  • Figure 4: Visual comparisons with some representative methods on the multi-organ segmentation (Synapse) dataset.
  • Figure 5: Visual comparisons with some representative methods on the automatic cardiac diagnosis challenge (ACDC) dataset.
  • ...and 4 more figures