Table of Contents
Fetching ...

A Systematic Review of Intermediate Fusion in Multimodal Deep Learning for Biomedical Applications

Valerio Guarrasi, Fatih Aksu, Camillo Maria Caruso, Francesco Di Feola, Aurora Rofena, Filippo Ruffini, Paolo Soda

TL;DR

This systematic review targets intermediate fusion in biomedical multimodal deep learning, formalizing how modality-specific features are fused during learning and introducing a structured notation to standardize analysis. It synthesizes 54 studies to map modalities, unimodal and fusion modules, and learning strategies, revealing a predominance of single, sudden fusion and a reliance on raw imaging and tabular data. Key contributions include a formal fusion notation, a taxonomy of fusion operations (Concatenation, Tensor-operation, Attention, Calibration, Knowledge-sharing), and insights into data sources, dataset characteristics, and experimental rigor. The work highlights gaps in explainability, missing-modality robustness, benchmark datasets, and rigorous experimental protocols, offering concrete directions toward more generalizable, interpretable, and clinically deployable MDL systems in biomedicine.

Abstract

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review aims to comprehensively analyze and formalize current intermediate fusion methods in biomedical applications. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a structured notation to enhance the understanding and application of these methods beyond the biomedical domain. Our findings are intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL.

A Systematic Review of Intermediate Fusion in Multimodal Deep Learning for Biomedical Applications

TL;DR

This systematic review targets intermediate fusion in biomedical multimodal deep learning, formalizing how modality-specific features are fused during learning and introducing a structured notation to standardize analysis. It synthesizes 54 studies to map modalities, unimodal and fusion modules, and learning strategies, revealing a predominance of single, sudden fusion and a reliance on raw imaging and tabular data. Key contributions include a formal fusion notation, a taxonomy of fusion operations (Concatenation, Tensor-operation, Attention, Calibration, Knowledge-sharing), and insights into data sources, dataset characteristics, and experimental rigor. The work highlights gaps in explainability, missing-modality robustness, benchmark datasets, and rigorous experimental protocols, offering concrete directions toward more generalizable, interpretable, and clinically deployable MDL systems in biomedicine.

Abstract

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review aims to comprehensively analyze and formalize current intermediate fusion methods in biomedical applications. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a structured notation to enhance the understanding and application of these methods beyond the biomedical domain. Our findings are intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL.
Paper Structure (50 sections, 13 equations, 25 figures, 5 tables)

This paper contains 50 sections, 13 equations, 25 figures, 5 tables.

Figures (25)

  • Figure 1: Venn Diagram of Search Hits by Category. This diagram displays the percentage of search hits for each specified category, ranging from general fields to more specialized areas, i.e., "Multimodal Learning", "Multimodal Learning in Biomedical Applications", "Multimodal Deep Learning", "Multimodal Deep Learning in Biomedical Applications", "Multimodal Deep Learning via Intermediate Fusion", "Multimodal Deep Learning in Biomedical Applications via Intermediate Fusion".
  • Figure 2: PRISMA Flow Chart of Literature Selection Process. This flow chart outlines the systematic process of screening and selecting studies for inclusion in the review. It details the number of records identified, included, and excluded at each stage of the search and selection process, from initial database search through to the final included studies.
  • Figure 3: Schematic Representation of Intermediate Fusion in MDL. This diagram illustrates the intermediate fusion approach within the framework of MDL. It begins with various data modalities $x_1, x_2, \ldots, x_n$ each represented in distinct colors (green, purple, orange), progressing through specialized unimodal modules $f_1, f_2, \ldots, f_n$ where individual features $h_1, h_2, \ldots, h_n$ are extracted, maintaining the color coding of their respective modalities. These features are then integrated by the fusion module $\mathscr{F}$, resulting in a multimodal feature representation $h$ that is a blend of all the input colors, signifying the fusion of modal features. The subsequent multimodal module $f$ processes these integrated features, aiming to achieve the specified target $y$ shown in yellow. Red arrows indicate the forward pass through the network, i.e., the inference phase, while blue arrows represent the backpropagation, facilitating the training of both unimodal and multimodal components.
  • Figure 4: Stacked bar plot showing the type of modalities used in the analyzed articles. The bar for each macro-modality is segmented, providing information on the specific modalities included.
  • Figure 5: Distribution of Balanced, Imbalanced, Highly Imbalanced datasets.
  • ...and 20 more figures