Review of multimodal machine learning approaches in healthcare
Felix Krones, Umar Marikkar, Guy Parsons, Adam Szmul, Adam Mahdi
TL;DR
This paper addresses the gap between single-modality machine learning and the multimodal realities of clinical decision-making. It surveys data modalities (imaging, text, time-series, tabular), fusion architectures (early/intermediate/late/mixed), and model development stages (pre-training and fine-tuning), grounded in a comprehensive review of datasets and studies. Key contributions include a structured taxonomy of fusion strategies, a catalog of multimodal healthcare datasets, and a synthesis of application domains (brain disorders, cancer, chest diseases, dermatology) with transfer-learning trends. The work highlights practical implications for model robustness, interpretability, and deployment, and emphasizes future directions such as foundation models and better data integration to advance clinically impactful multimodal AI.
Abstract
Machine learning methods in healthcare have traditionally focused on using data from a single modality, limiting their ability to effectively replicate the clinical practice of integrating multiple sources of information for improved decision making. Clinicians typically rely on a variety of data sources including patients' demographic information, laboratory data, vital signs and various imaging data modalities to make informed decisions and contextualise their findings. Recent advances in machine learning have facilitated the more efficient incorporation of multimodal data, resulting in applications that better represent the clinician's approach. Here, we provide a review of multimodal machine learning approaches in healthcare, offering a comprehensive overview of recent literature. We discuss the various data modalities used in clinical diagnosis, with a particular emphasis on imaging data. We evaluate fusion techniques, explore existing multimodal datasets and examine common training strategies.
