Automated Ensemble Multimodal Machine Learning for Healthcare

Fergus Imrie; Stefan Denner; Lucas S. Brunschwig; Klaus Maier-Hein; Mihaela van der Schaar

Automated Ensemble Multimodal Machine Learning for Healthcare

Fergus Imrie, Stefan Denner, Lucas S. Brunschwig, Klaus Maier-Hein, Mihaela van der Schaar

TL;DR

This work addresses the gap in clinical ML by enabling automated, multimodal learning that combines tabular clinical data with imaging. It introduces AutoPrognosis-M, an AutoML-based framework that supports three fusion strategies (Late, Early, Joint) and ensembles them to optimize predictive performance, explainability, and uncertainty estimation. Demonstrated on the PAD-UFES-20 skin lesion dataset, the approach shows that multimodal integration consistently outperforms unimodal baselines, with selective data acquisition guided by uncertainty achieving substantial gains. The framework is open-sourced to accelerate clinical adoption and further innovation in multimodal healthcare AI.

Abstract

The application of machine learning in medicine and healthcare has led to the creation of numerous diagnostic and prognostic models. However, despite their success, current approaches generally issue predictions using data from a single modality. This stands in stark contrast with clinician decision-making which employs diverse information from multiple sources. While several multimodal machine learning approaches exist, significant challenges in developing multimodal systems remain that are hindering clinical adoption. In this paper, we introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning. AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies. In an illustrative application using a multimodal skin lesion dataset, we highlight the importance of multimodal machine learning and the power of combining multiple fusion strategies using ensemble learning. We have open-sourced our framework as a tool for the community and hope it will accelerate the uptake of multimodal machine learning in healthcare and spur further innovation.

Automated Ensemble Multimodal Machine Learning for Healthcare

TL;DR

Abstract

Paper Structure (20 sections, 5 figures, 7 tables)

This paper contains 20 sections, 5 figures, 7 tables.

Introduction
Methods: AutoPrognosis-M
Automated Machine Learning
Unimodal approaches
Tabular
Imaging
Multimodal data integration
Late Fusion
Early fusion
Joint Fusion
Fusion Ensembles
Explainability
Uncertainty estimation
Experiments
Data
...and 5 more sections

Figures (5)

Figure 1: Overview of the types of questions that can be asked with multimodal machine learning. In addition to developing powerful multimodal models (e), multimodal ML can help understand the value of each modality (a), the impact of adding a new modality (b), when an additional modality is required (c), and how the information in different modalities interacts (d).
Figure 2: Overview of AutoPrognosis-M. AutoPrognosis-M leverages automated machine learning to produce multimodal ensembles by optimizing state-of-the-art image and tabular modeling approaches across three fusion strategies. AutoPrognosis-M also enables such models to be interrogated with explainable AI and provides uncertainty estimates using conformal prediction.
Figure 3: Illustration of the three types of multimodal fusion. (a) Late fusion combines the predictions of separate unimodal models. (b) Early fusion trains a predictive model on the combination of fixed extracted features. (c) Joint fusion flexibly integrates multiple modalities, learning to extract representations and make predictions simultaneously in an end-to-end manner.
Figure 4: Selective acquisition of images based on conformal prediction. By acquiring images for around 20% of samples with the highest predicted uncertainty based on the tabular features, we capture c. 55% and 65% of the improvement of the multimodal classifier for (a) lesion categorization and (b) cancer diagnosis, respectively. We approach the performance of the multimodal classifier by acquiring images for around half of all patients.
Figure 5: Comparison of explanations for unimodal and multimodal models using integrated gradients. The original image (left, img_id: PAT_521_984_412) together with attributions for the unimodal (center left) and joint fusion EfficientNetB4 models (center right and right).

Automated Ensemble Multimodal Machine Learning for Healthcare

TL;DR

Abstract

Automated Ensemble Multimodal Machine Learning for Healthcare

Authors

TL;DR

Abstract

Table of Contents

Figures (5)