Table of Contents
Fetching ...

Structure is Supervision: Multiview Masked Autoencoders for Radiology

Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt

TL;DR

<3-5 sentence high-level summary> Radiology data are inherently structured, with multiple projections per exam and associated reports, but many self-supervised methods treat images in isolation. The authors propose MVMAE, a multiview masked autoencoder that jointly performs per-view reconstruction and cross-view alignment to learn view-invariant, detail-preserving representations, and they extend it with MVMAE-V2T to use radiology reports as an auxiliary textual signal during pretraining. Evaluated on MIMIC-CXR, CheXpert, and PadChest, MVMAE achieves state-of-the-art disease classification with strong calibration and label efficiency, while MVMAE-V2T offers additional gains in limited-label scenarios without requiring text at inference. The work demonstrates that leveraging study-level structure and textual supervision can yield scalable, clinically grounded foundation models that transfer across institutions and modalities.

Abstract

Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

Structure is Supervision: Multiview Masked Autoencoders for Radiology

TL;DR

<3-5 sentence high-level summary> Radiology data are inherently structured, with multiple projections per exam and associated reports, but many self-supervised methods treat images in isolation. The authors propose MVMAE, a multiview masked autoencoder that jointly performs per-view reconstruction and cross-view alignment to learn view-invariant, detail-preserving representations, and they extend it with MVMAE-V2T to use radiology reports as an auxiliary textual signal during pretraining. Evaluated on MIMIC-CXR, CheXpert, and PadChest, MVMAE achieves state-of-the-art disease classification with strong calibration and label efficiency, while MVMAE-V2T offers additional gains in limited-label scenarios without requiring text at inference. The work demonstrates that leveraging study-level structure and textual supervision can yield scalable, clinically grounded foundation models that transfer across institutions and modalities.

Abstract

Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

Paper Structure

This paper contains 31 sections, 7 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Framework overview. (a) Pretraining and (b) Finetuning stages of the proposed MVMAE and MVMAE-V2T frameworks. Each study includes multiple views processed jointly through masked reconstruction and cross-view alignment losses. MVMAE-V2T additionally incorporates a vision-to-text objective.
  • Figure 2: Study-level instance in MIMIC-CXR. The first row shows three different views in the same study, with their corresponding report and final labels.
  • Figure 3: Per-label AUROC comparison across datasets (MIMIC-CXR, CheXpert Plus, PadChest). Left: Overall performance with model selection done based on the best joint dataset. Right: Model selection done to optimize per label results.
  • Figure 4: Label efficiency under finetuning. Performance curves of the Combined dataset macro-average AUROC over 14 pathology labels, as a function of the number of labeled studies used for finetuning.
  • Figure 5: Average AUROC performance gain of MVMAE on Independent over three seeds.
  • ...and 6 more figures