Table of Contents
Fetching ...

Multimodal Information Interaction for Medical Image Segmentation

Xinxin Fan, Lin Liu, Haoran Zhang

TL;DR

MicFormer tackles multimodal medical image segmentation by introducing a dual-stream transformer backbone that explicitly fuses and communicates information across modalities. It combines a U-shaped parallel feature network with Swin Transformer blocks and a deformable Cross Transformer, leveraging cross attention with a deformable sampling mechanism to align and refine features from different modalities. The approach achieves state-of-the-art Dice and MIoU on the MM-WHS CT–MRI whole-heart segmentation task, while noting a slight HD95 trade-off attributed to inductive biases in competing architectures. Overall, MicFormer demonstrates the effectiveness of deformable cross-attention for cross-modal feature integration and suggests strong potential for broader multimodal imaging tasks.

Abstract

The use of multimodal data in assisted diagnosis and segmentation has emerged as a prominent area of interest in current research. However, one of the primary challenges is how to effectively fuse multimodal features. Most of the current approaches focus on the integration of multimodal features while ignoring the correlation and consistency between different modal features, leading to the inclusion of potentially irrelevant information. To address this issue, we introduce an innovative Multimodal Information Cross Transformer (MicFormer), which employs a dual-stream architecture to simultaneously extract features from each modality. Leveraging the Cross Transformer, it queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Additionally, we incorporate a deformable Transformer architecture to expand the search space. We conducted experiments on the MM-WHS dataset, and in the CT-MRI multimodal image segmentation task, we successfully improved the whole-heart segmentation DICE score to 85.57 and MIoU to 75.51. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively. This demonstrates the efficacy of MicFormer in integrating relevant information between different modalities in multimodal tasks. These findings hold significant implications for multimodal image tasks, and we believe that MicFormer possesses extensive potential for broader applications across various domains. Access to our method is available at https://github.com/fxxJuses/MICFormer

Multimodal Information Interaction for Medical Image Segmentation

TL;DR

MicFormer tackles multimodal medical image segmentation by introducing a dual-stream transformer backbone that explicitly fuses and communicates information across modalities. It combines a U-shaped parallel feature network with Swin Transformer blocks and a deformable Cross Transformer, leveraging cross attention with a deformable sampling mechanism to align and refine features from different modalities. The approach achieves state-of-the-art Dice and MIoU on the MM-WHS CT–MRI whole-heart segmentation task, while noting a slight HD95 trade-off attributed to inductive biases in competing architectures. Overall, MicFormer demonstrates the effectiveness of deformable cross-attention for cross-modal feature integration and suggests strong potential for broader multimodal imaging tasks.

Abstract

The use of multimodal data in assisted diagnosis and segmentation has emerged as a prominent area of interest in current research. However, one of the primary challenges is how to effectively fuse multimodal features. Most of the current approaches focus on the integration of multimodal features while ignoring the correlation and consistency between different modal features, leading to the inclusion of potentially irrelevant information. To address this issue, we introduce an innovative Multimodal Information Cross Transformer (MicFormer), which employs a dual-stream architecture to simultaneously extract features from each modality. Leveraging the Cross Transformer, it queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Additionally, we incorporate a deformable Transformer architecture to expand the search space. We conducted experiments on the MM-WHS dataset, and in the CT-MRI multimodal image segmentation task, we successfully improved the whole-heart segmentation DICE score to 85.57 and MIoU to 75.51. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively. This demonstrates the efficacy of MicFormer in integrating relevant information between different modalities in multimodal tasks. These findings hold significant implications for multimodal image tasks, and we believe that MicFormer possesses extensive potential for broader applications across various domains. Access to our method is available at https://github.com/fxxJuses/MICFormer
Paper Structure (11 sections, 1 equation, 3 figures, 1 table)

This paper contains 11 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Limitations of the Current Method: (a) The prioritization of image fusion using unimodal network form for multimodal image segmentation can result in inaccurate feature representation in the target feature region. (b) Modal feature fusion is restricted to a two-stream cross-attention fusion network. Query and Key matching is used to enhance the unimodal feature representation without incorporating additional information. Additionally, technical term abbreviations will be explained when first used.
  • Figure 2: Our MicFormer architecture, which consists of (a) U-shaped parallel feature network.(b) Cross Transformer Block.(c)Deformable Operator. We utilize deep separable convolution to split $Feature_a$ and $Feature_b$ in the channel direction and compute the positional differences at the corresponding positions.
  • Figure 3: Qualitative Results Analysis: The first row shows MRI image slices, followed by CT image slices in the second row. The third row displays CT's Ground Truth label. The next six rows exhibit the qualitative segmentation results of VT-UNet, Swin-Unet, SwinUneter, nnformer, MedNeXt, and MicFormer.