HyperMM : Robust Multimodal Learning with Varying-sized Inputs

Hava Chaptoukaev; Vincenzo Marcianó; Francesco Galati; Maria A. Zuluaga

HyperMM : Robust Multimodal Learning with Varying-sized Inputs

Hava Chaptoukaev, Vincenzo Marcianó, Francesco Galati, Maria A. Zuluaga

TL;DR

HyperMM tackles the problem of robust multimodal learning when some modalities are missing, common in clinical data. It introduces a two-step, end-to-end framework: first train a universal feature extractor φ conditioned by modality identifiers via a conditional hypernetwork, then freeze φ and use a permutation-invariant aggregator to fuse observed modalities with a classifier ρ, forming $f(X_{obs}) = \rho\left(\sum_{s_k \in S} \varphi(s_k)\right)$. The approach avoids imputation, handles varying input sizes as sets, and demonstrates robustness to high missingness while maintaining efficiency. Experiments on Alzheimer's disease detection and breast cancer classification show that HyperMM outperforms imputation-based and conventional multimodal methods and generalizes to datasets beyond missing-modality scenarios.

Abstract

Combining multiple modalities carrying complementary information through multimodal learning (MML) has shown considerable benefits for diagnosing multiple pathologies. However, the robustness of multimodal models to missing modalities is often overlooked. Most works assume modality completeness in the input data, while in clinical practice, it is common to have incomplete modalities. Existing solutions that address this issue rely on modality imputation strategies before using supervised learning models. These strategies, however, are complex, computationally costly and can strongly impact subsequent prediction models. Hence, they should be used with parsimony in sensitive applications such as healthcare. We propose HyperMM, an end-to-end framework designed for learning with varying-sized inputs. Specifically, we focus on the task of supervised MML with missing imaging modalities without using imputation before training. We introduce a novel strategy for training a universal feature extractor using a conditional hypernetwork, and propose a permutation-invariant neural network that can handle inputs of varying dimensions to process the extracted features, in a two-phase task-agnostic framework. We experimentally demonstrate the advantages of our method in two tasks: Alzheimer's disease detection and breast cancer classification. We demonstrate that our strategy is robust to high rates of missing data and that its flexibility allows it to handle varying-sized datasets beyond the scenario of missing modalities.

HyperMM : Robust Multimodal Learning with Varying-sized Inputs

TL;DR

. The approach avoids imputation, handles varying input sizes as sets, and demonstrates robustness to high missingness while maintaining efficiency. Experiments on Alzheimer's disease detection and breast cancer classification show that HyperMM outperforms imputation-based and conventional multimodal methods and generalizes to datasets beyond missing-modality scenarios.

Abstract

Paper Structure (11 sections, 2 equations, 4 figures, 2 tables)

This paper contains 11 sections, 2 equations, 4 figures, 2 tables.

Introduction
Related work
Contributions
Methodology
Overview of the method
Universal Feature Extractor
Permutation Invariant Architecture
Experiments
Alzheimer's Disease Detection
Breast Cancer Classification
Conclusion

Figures (4)

Figure 1: Overview of our HyperMM framework. A network $\varphi$ is trained to extract features from any modality in $\mathcal{D}$ by jointly optimizing feature reconstruction and unimodal prediction (step 1). The learned $\varphi$ is frozen, and used to process multimodal inputs, the latent features are then aggregated and processed through a network $\rho$ for prediction (step 2).
Figure 2: Feature extraction strategy used in the ADNI baselines (see liang2021alzheimer). All 2D slices of one 3D volume are fed to a VGG11. A 1D max pooling on the slice dimension is applied to the resulting feature blocks to obtain a single block per 3D image. The latter is passed through a $1 \times 1$ convolution layer to obtain AD-specific features that can then be fed to a classifier.
Figure 3: Examples of real and imputed slices of MRI and PET images for one patient. While the PET reconstructions (bottom right) translated from the corresponding MRI (top left) are reasonably similar to the original PET image (bottom left), the MRI reconstructions (top right) translated from the low-resolution PET (bottom left) are much less consistent with reality (top left).
Figure 4: Comparison of decision strategies for patient-level tumor classification. Our method (left) enables the combination of a subject's available images during training, regardless of the magnification level to obtain a patient-level decision. In opposition, traditionnal approaches (right) make prediction on the image-level, and combine the final predictions to obtain a patient-level decision.

HyperMM : Robust Multimodal Learning with Varying-sized Inputs

TL;DR

Abstract

HyperMM : Robust Multimodal Learning with Varying-sized Inputs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)