Table of Contents
Fetching ...

MISSFormer: An Effective Medical Image Segmentation Transformer

Xiaohong Huang, Zhifang Deng, Dandan Li, Xueguang Yuan

TL;DR

MISSFormer addresses the limitation of CNNs in modeling long-range dependencies for medical image segmentation by introducing a hierarchical transformer with an Enhanced Transformer Block and an Enhanced Transformer Context Bridge to fuse multi-scale features. The approach achieves state-of-the-art performance on Synapse and ACDC, even when trained from scratch, demonstrating strong long-range and local-context modeling without reliance on ImageNet pre-training. Key contributions include the redesigned feed-forward network (Enhanced Mix-FFN), Efficient Self-Attention, and a multi-scale bridge that jointly capture global and local information. The results indicate the core designs generalize well to other visual segmentation tasks and offer robust, edge-aware segmentation across organs and cardiac structures.

Abstract

The CNN-based methods have achieved impressive results in medical image segmentation, but they failed to capture the long-range dependencies due to the inherent locality of the convolution operation. Transformer-based methods are recently popular in vision tasks because of their capacity for long-range dependencies and promising performance. However, it lacks in modeling local context. In this paper, taking medical image segmentation as an example, we present MISSFormer, an effective and powerful Medical Image Segmentation tranSFormer. MISSFormer is a hierarchical encoder-decoder network with two appealing designs: 1) A feed-forward network is redesigned with the proposed Enhanced Transformer Block, which enhances the long-range dependencies and supplements the local context, making the feature more discriminative. 2) We proposed Enhanced Transformer Context Bridge, different from previous methods of modeling only global information, the proposed context bridge with the enhanced transformer block extracts the long-range dependencies and local context of multi-scale features generated by our hierarchical transformer encoder. Driven by these two designs, the MISSFormer shows a solid capacity to capture more discriminative dependencies and context in medical image segmentation. The experiments on multi-organ and cardiac segmentation tasks demonstrate the superiority, effectiveness and robustness of our MISSFormer, the experimental results of MISSFormer trained from scratch even outperform state-of-the-art methods pre-trained on ImageNet. The core designs can be generalized to other visual segmentation tasks. The code has been released on Github: https://github.com/ZhifangDeng/MISSFormer

MISSFormer: An Effective Medical Image Segmentation Transformer

TL;DR

MISSFormer addresses the limitation of CNNs in modeling long-range dependencies for medical image segmentation by introducing a hierarchical transformer with an Enhanced Transformer Block and an Enhanced Transformer Context Bridge to fuse multi-scale features. The approach achieves state-of-the-art performance on Synapse and ACDC, even when trained from scratch, demonstrating strong long-range and local-context modeling without reliance on ImageNet pre-training. Key contributions include the redesigned feed-forward network (Enhanced Mix-FFN), Efficient Self-Attention, and a multi-scale bridge that jointly capture global and local information. The results indicate the core designs generalize well to other visual segmentation tasks and offer robust, edge-aware segmentation across organs and cardiac structures.

Abstract

The CNN-based methods have achieved impressive results in medical image segmentation, but they failed to capture the long-range dependencies due to the inherent locality of the convolution operation. Transformer-based methods are recently popular in vision tasks because of their capacity for long-range dependencies and promising performance. However, it lacks in modeling local context. In this paper, taking medical image segmentation as an example, we present MISSFormer, an effective and powerful Medical Image Segmentation tranSFormer. MISSFormer is a hierarchical encoder-decoder network with two appealing designs: 1) A feed-forward network is redesigned with the proposed Enhanced Transformer Block, which enhances the long-range dependencies and supplements the local context, making the feature more discriminative. 2) We proposed Enhanced Transformer Context Bridge, different from previous methods of modeling only global information, the proposed context bridge with the enhanced transformer block extracts the long-range dependencies and local context of multi-scale features generated by our hierarchical transformer encoder. Driven by these two designs, the MISSFormer shows a solid capacity to capture more discriminative dependencies and context in medical image segmentation. The experiments on multi-organ and cardiac segmentation tasks demonstrate the superiority, effectiveness and robustness of our MISSFormer, the experimental results of MISSFormer trained from scratch even outperform state-of-the-art methods pre-trained on ImageNet. The core designs can be generalized to other visual segmentation tasks. The code has been released on Github: https://github.com/ZhifangDeng/MISSFormer

Paper Structure

This paper contains 12 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The overall structure of the proposed MISSFormer. (a) The proposed MISSFormer framework. (b) The structure of Enhanced Transformer Block.
  • Figure 2: The various exploration of locality in feed-forward neural network, from left to right: (a)Residual Block in LocalViT, (b)LeFF in Uformer, Mix-FFN in SegFormer and PVTv2, (c) proposed Simple Enhanced Mix-FFN, (d) proposed Enhanced Mix-FFN
  • Figure 3: The Enhanced Transformer Context Bridge
  • Figure 4: The average L1 norm of gradients to the second fully connected weight in FFN for layer 0,1,3,6,7
  • Figure 5: The convergence and evaluation results of different methods.
  • ...and 1 more figures