Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

Elisa Warner; Joonsang Lee; William Hsu; Tanveer Syeda-Mahmood; Charles Kahn; Olivier Gevaert; Arvind Rao

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles Kahn, Olivier Gevaert, Arvind Rao

TL;DR

This survey addresses how multimodal learning can improve image-based biomedicine and clinical decision support by surveying five core challenges: representation, fusion, translation, alignment, and co-learning. It reviews taxonomy-driven approaches and modern methods, including joint vs coordinated representations, model-based fusion, and privileged learning and domain adaptation, with MRI/CT/PET and EHR modalities as focal domains. It highlights transformer-based alignment and image-to-image translation as key directions, and discusses practical deployment considerations like missing data and standardization. It concludes with a call for principled evaluation, uncertainty quantification, data sharing, and ethical governance to translate multimodal biomedical AI into safe, effective clinical practice.

Abstract

Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing multimodal representation, fusion, translation, alignment, and co-learning, the paper explores the transformative potential of multimodal models for clinical predictions. It also highlights the need for principled assessments and practical implementation of such models, bringing attention to the dynamics between decision support systems and healthcare providers and personnel. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist. We conclude with a discussion on principled innovation and collaborative efforts to further the mission of seamless integration of multimodal ML models into biomedical practice.

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

TL;DR

Abstract

Paper Structure (10 sections, 2 equations, 4 figures)

This paper contains 10 sections, 2 equations, 4 figures.

Introduction
Multimodal Learning in Medical Applications
Representation
Fusion
Translation
Alignment
Co-learning
Privileged Learning
Domain Adaptation
Discussion

Figures (4)

Figure 1: Challenges in multimodal learning: 1) Representation, which concerns how multiple modalities will be geometrically represented and how to represent intrinsic relationships between them; 2) Fusion, the challenge of combining multiple modalities into a predictive model; 3) Translation, involving the mapping of one modality to another; 4) Alignment, which attempts to align two separate modalities spatially or temporally; and 5) Co-learning, which involves using one modality to assist the learning of another modality.
Figure 2: A graphical representation of the taxonomical sublevels of multimodal representation and fusion, and the focus of each challenge. Multimodal representation can be categorized into whether the representations are joined into a single vector (joint) or separately constructed to be influenced by each other (coordinated). Multimodal fusion can be distinguished by whether a model is uniquely constructed to fuse the modalities (model-based), or whether fusion occurs before or after the model step (model-agnostic).
Figure 3: A graphical representation of the taxonomical sublevels of multimodal translation, alignment and co-learning, and the focus of each challenge. In translation, models are distinguished based on whether they require use of a dictionary to save associations between modalities (dictionary-based), or if the associations are learned in a multimodal network (generative). In alignment, distinction is made depending on the purpose of the alignment, whether as the goal (explicit) or as an intermediate step towards the goal output (implicit). In co-learning, a distinction is made between the use of parallel (paired) multimodal data, or non-parallel (unpaired) multimodal data. In co-learning models, one of the modalities is only used in training but does not appear in testing.
Figure 4: Two types of transfer learning described in this work are privileged learning (top) and domain adaptation (bottom). In privileged learning, a plentiful set consisting of data which is normally of low cost but also low signal-to-noise ratio is available in both training and testing, while a limited gold-standard quality set is used for training only. In this example, the plentiful set is used to train the target model, while the limited set constrains the model parameters to increase the model's ability to associate the low-cost modality with the ground truth. In domain adaptation, there is a target dataset which consists of a few samples and a source dataset consisting of plenty of samples. If the target data is too small to build a reliable model in training, source data can be augmented to make the model more robust. Else, the target model could be trained with few examples, while a second source model is used to help make the target model more generalizable.

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

TL;DR

Abstract

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

Authors

TL;DR

Abstract

Table of Contents

Figures (4)