Table of Contents
Fetching ...

XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification

Xiaoyu Zheng, Xu Chen, Shaogang Gong, Xavier Griffin, Greg Slabaugh

TL;DR

XFMamba addresses multi-view medical image classification by learning cross-view correlations with a pure state-space Mamba architecture. It introduces a four-stage dual-view encoder and a two-stage fusion module consisting of a shallow Cross-View Swapping Mamba (CVSM) and a deep Multi-View Combination Mamba (MVCM) to fuse features efficiently. Across MURA, CheXpert, and CBIS-DDSM, XFMamba achieves state-of-the-art AUROC with competitive efficiency, and ablation confirms the contributions of CVSM and MVCM. The work suggests a practical baseline for cross-view and potential multi-modality fusion in medical imaging.

Abstract

Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.

XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification

TL;DR

XFMamba addresses multi-view medical image classification by learning cross-view correlations with a pure state-space Mamba architecture. It introduces a four-stage dual-view encoder and a two-stage fusion module consisting of a shallow Cross-View Swapping Mamba (CVSM) and a deep Multi-View Combination Mamba (MVCM) to fuse features efficiently. Across MURA, CheXpert, and CBIS-DDSM, XFMamba achieves state-of-the-art AUROC with competitive efficiency, and ablation confirms the contributions of CVSM and MVCM. The work suggests a practical baseline for cross-view and potential multi-modality fusion in medical imaging.

Abstract

Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.

Paper Structure

This paper contains 15 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: XFMamba architecture. (i) The overall architecture is composed of a four-stage encoder and two-stage fusion module. (ii) Visual State Space Module (VSSM) for feature extraction. (iii) Cross-view swapping Mamba (CVSM) block for shallow fusion. (iv) Multi-view combination Mamba (MVCM) block for deep fusion.
  • Figure 2: (Left) Computational complexity comparison on CBIS-DDSM dataset. The size of each circle denotes the model size, i.e., parameters. (Right) Qualitative results for successful cases and failed cases (right of the orange dotted line) using different methods on the CBIS-DDSM dataset.