Review of multimodal machine learning approaches in healthcare

Felix Krones; Umar Marikkar; Guy Parsons; Adam Szmul; Adam Mahdi

Review of multimodal machine learning approaches in healthcare

Felix Krones, Umar Marikkar, Guy Parsons, Adam Szmul, Adam Mahdi

TL;DR

This paper addresses the gap between single-modality machine learning and the multimodal realities of clinical decision-making. It surveys data modalities (imaging, text, time-series, tabular), fusion architectures (early/intermediate/late/mixed), and model development stages (pre-training and fine-tuning), grounded in a comprehensive review of datasets and studies. Key contributions include a structured taxonomy of fusion strategies, a catalog of multimodal healthcare datasets, and a synthesis of application domains (brain disorders, cancer, chest diseases, dermatology) with transfer-learning trends. The work highlights practical implications for model robustness, interpretability, and deployment, and emphasizes future directions such as foundation models and better data integration to advance clinically impactful multimodal AI.

Abstract

Machine learning methods in healthcare have traditionally focused on using data from a single modality, limiting their ability to effectively replicate the clinical practice of integrating multiple sources of information for improved decision making. Clinicians typically rely on a variety of data sources including patients' demographic information, laboratory data, vital signs and various imaging data modalities to make informed decisions and contextualise their findings. Recent advances in machine learning have facilitated the more efficient incorporation of multimodal data, resulting in applications that better represent the clinician's approach. Here, we provide a review of multimodal machine learning approaches in healthcare, offering a comprehensive overview of recent literature. We discuss the various data modalities used in clinical diagnosis, with a particular emphasis on imaging data. We evaluate fusion techniques, explore existing multimodal datasets and examine common training strategies.

Review of multimodal machine learning approaches in healthcare

TL;DR

Abstract

Paper Structure (22 sections, 5 figures, 5 tables)

This paper contains 22 sections, 5 figures, 5 tables.

Introduction
Data modalities
Imaging data
Text data
Time-series data
Tabular data
Model development
Data pre-processing
Stage 1: Model pre-training
Stage 2: Model fine-tuning
Model evaluation
Fusion approaches
Modality-level fusion
Feature-level fusion
Multimodal applications
...and 7 more sections

Figures (5)

Figure 1: Clinical data modalities and prediction tasks. Distinct data modalities play pivotal roles in clinical decision-making: imaging data, text data, time-series data and tabular data. All are used for various clinical predictions tasks. Medical prediction tasks in clinical practice involve leveraging machine learning models and algorithms to forecast future clinical outcomes based on existing patient data. They play a crucial role in the decision-making process for diagnosis, prognosis and treatment.
Figure 2: Model development. After pre-training the model, the model weights are fine-tuned on the target domain (e.g. medical images) and the model architecture is adjusted to the target task (e.g. classification).
Figure 3: Data fusion architectures: (a) Early fusion combines raw features or extracted features before passing them into the final model. The feature extraction is optional; (b) Intermediate fusion concatenates features extracted from the original data using an integrated modelling approach where the loss is back-propagated through the whole model; (c) In late fusion the predictions or features are generated by multiple models and aggregated after their individual processing.
Figure 4: Examples of mixed fusion architectures: (a) The loss is only back-propagated for some modalities (blue) while others (yellow) are fused at a later step; (b) Similar to (a), but predictions from only one modality are used; (c) Features from one modality (blue) are combined with predictions from another modality (yellow).
Figure 5: Feature level fusion: (a) Concatenation involves merging feature vectors end-to-end. (b) Operation-based methods combine vectors via element-wise mathematical operations or attention mechanisms, necessitating same-shaped vectors. (c) Learning-based fusion uses machine learning to reconstruct original features in a shared informative space.

Review of multimodal machine learning approaches in healthcare

TL;DR

Abstract

Review of multimodal machine learning approaches in healthcare

Authors

TL;DR

Abstract

Table of Contents

Figures (5)