Table of Contents
Fetching ...

MultiFusionNet: Multilayer Multimodal Fusion of Deep Neural Networks for Chest X-Ray Image Classification

Saurabh Agarwal, K. V. Arya, Yogesh Kumar Meena

TL;DR

This work tackles chest X-ray disease classification under limited data by introducing a multilayer multimodal fusion framework that fuses multi-layer features from ResNet50V2 and InceptionV3 using a dedicated Fusion of Different-Sized Feature Maps (FDSFM) module and a model-level addition fusion. The approach preserves discriminative information across depths, reduces parameter count, and demonstrates strong performance on the Cov-Pneum dataset, achieving 97.21% accuracy for three-class and 99.60% for two-class tasks, outperforming state-of-the-art methods. Grad-CAM provides visual explanations, enhancing clinical trust and interpretability. The methodology is extensible to other chest-imaging problems and potential multimodal integrations, offering a practical path toward robust computer-aided diagnosis systems.

Abstract

Chest X-ray imaging is a critical diagnostic tool for identifying pulmonary diseases. However, manual interpretation of these images is time-consuming and error-prone. Automated systems utilizing convolutional neural networks (CNNs) have shown promise in improving the accuracy and efficiency of chest X-ray image classification. While previous work has mainly focused on using feature maps from the final convolution layer, there is a need to explore the benefits of leveraging additional layers for improved disease classification. Extracting robust features from limited medical image datasets remains a critical challenge. In this paper, we propose a novel deep learning-based multilayer multimodal fusion model that emphasizes extracting features from different layers and fusing them. Our disease detection model considers the discriminatory information captured by each layer. Furthermore, we propose the fusion of different-sized feature maps (FDSFM) module to effectively merge feature maps from diverse layers. The proposed model achieves a significantly higher accuracy of 97.21% and 99.60% for both three-class and two-class classifications, respectively. The proposed multilayer multimodal fusion model, along with the FDSFM module, holds promise for accurate disease classification and can also be extended to other disease classifications in chest X-ray images.

MultiFusionNet: Multilayer Multimodal Fusion of Deep Neural Networks for Chest X-Ray Image Classification

TL;DR

This work tackles chest X-ray disease classification under limited data by introducing a multilayer multimodal fusion framework that fuses multi-layer features from ResNet50V2 and InceptionV3 using a dedicated Fusion of Different-Sized Feature Maps (FDSFM) module and a model-level addition fusion. The approach preserves discriminative information across depths, reduces parameter count, and demonstrates strong performance on the Cov-Pneum dataset, achieving 97.21% accuracy for three-class and 99.60% for two-class tasks, outperforming state-of-the-art methods. Grad-CAM provides visual explanations, enhancing clinical trust and interpretability. The methodology is extensible to other chest-imaging problems and potential multimodal integrations, offering a practical path toward robust computer-aided diagnosis systems.

Abstract

Chest X-ray imaging is a critical diagnostic tool for identifying pulmonary diseases. However, manual interpretation of these images is time-consuming and error-prone. Automated systems utilizing convolutional neural networks (CNNs) have shown promise in improving the accuracy and efficiency of chest X-ray image classification. While previous work has mainly focused on using feature maps from the final convolution layer, there is a need to explore the benefits of leveraging additional layers for improved disease classification. Extracting robust features from limited medical image datasets remains a critical challenge. In this paper, we propose a novel deep learning-based multilayer multimodal fusion model that emphasizes extracting features from different layers and fusing them. Our disease detection model considers the discriminatory information captured by each layer. Furthermore, we propose the fusion of different-sized feature maps (FDSFM) module to effectively merge feature maps from diverse layers. The proposed model achieves a significantly higher accuracy of 97.21% and 99.60% for both three-class and two-class classifications, respectively. The proposed multilayer multimodal fusion model, along with the FDSFM module, holds promise for accurate disease classification and can also be extended to other disease classifications in chest X-ray images.
Paper Structure (21 sections, 6 equations, 7 figures, 3 tables)

This paper contains 21 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The architecture model of the multilayer multimodal fusion incorporates the InceptionV3 and ResNet50V2 models. In the InceptionV3 model, blocks 1 to 11 consist of layers with different-sized filters (1$\times$1, 3$\times$3, and 5$\times$5) in parallel mode. Similarly, the ResNet50V2 model includes layers from blocks 1 to 5 with different-sized filters (1$\times$1 and 3$\times$3). The FM box represents the extracted feature maps generated from different layers of their respective networks.
  • Figure 2: The layered diagram illustrates the novel module called Fusion of Different-Sized Feature Maps (FDSFM). The size transformation is achieved through pooling sampling and the Conv 1$\times$1 layer, enabling effective fusion of feature maps.
  • Figure 3: Comparison of the learning behavior of baseline models (TL with ResNet50V2 ResNet, TL with InceptionV3 inception) and proposed models in terms of training accuracy (a) and training loss curves at per epoch time.
  • Figure 4: Performance evaluation of state-of-art and our models on proposed Conv-Pneum dataset. Computed classification accuracy (%) for both 3-class (3-C) and binary (2-C) class models. Evaluate individual class performance P: Precision, R: Recall, and F-1 : F-1 Score. Here, M1 is multilayer fusion of ResNet50V2 model, M2 is multilayer fusion of InceptionV3 model, M3 is singlelayer multimodal fusion model, and M4 is multilayer multimodal fusion model.
  • Figure 5: Performance of proposed models on the Conv-Pneum dataset. The confusion matrix represents the detection rate of COVID-19 (C), Pneumonia (P), and Normal (N) (3-Class) in (a-d). (a) multilayer fusion of ResNet50V2 model, (b) multilayer fusion of InceptionV3 model, (c) singlelayer multimodal fusion model, and (d) multilayer multimodal fusion model. The detection rate of COVID-19 and Normal (2-Class) in (e-h). (e) multilayer fusion of ResNet50V2 model, (f) multilayer fusion of InceptionV3 model, (g) singlelayer multimodal fusion model, and (h) multilayer multimodal fusion model.
  • ...and 2 more figures