Table of Contents
Fetching ...

MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder

Ayhan Can Erdur, Christian Beischl, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C Peeken

TL;DR

Medical MRI data frequently suffer from missing sequences, hindering multi-modal analysis. The authors propose a MultiMAE-based pretraining framework with modality-specific encoders and per-modality decoders to enable cross-modal reconstruction and robustness to missing inputs in 3D brain MRI, using a late-fusion transformer backbone and a high masking ratio of $75\%$ across modalities. Across segmentation and glioma subtype classification, the approach achieves substantial improvements over MAE-ViT baselines, including an absolute Dice gain of $10.1$ and an MCC gain of $0.46$ in settings with missing inputs, and it can synthesize entirely missing modalities at inference. The method generalizes to external datasets, demonstrates resilience to missing modalities, and offers a flexible encoder suitable for diverse downstream tasks, though reconstruction blur and domain gaps remain as areas for further refinement, with potential gains from perceptual losses and advanced decoders.

Abstract

Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.

MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder

TL;DR

Medical MRI data frequently suffer from missing sequences, hindering multi-modal analysis. The authors propose a MultiMAE-based pretraining framework with modality-specific encoders and per-modality decoders to enable cross-modal reconstruction and robustness to missing inputs in 3D brain MRI, using a late-fusion transformer backbone and a high masking ratio of across modalities. Across segmentation and glioma subtype classification, the approach achieves substantial improvements over MAE-ViT baselines, including an absolute Dice gain of and an MCC gain of in settings with missing inputs, and it can synthesize entirely missing modalities at inference. The method generalizes to external datasets, demonstrates resilience to missing modalities, and offers a flexible encoder suitable for diverse downstream tasks, though reconstruction blur and domain gaps remain as areas for further refinement, with potential gains from perceptual losses and advanced decoders.

Abstract

Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of overall Dice score and MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Adapting the MultiMAE workflow for brain MRIs
  • Figure 2: Reconstruction of a missing modality. The depicted modality was fully masked (for MAE: replaced with background), and then reconstructed using the remaining inputs.