Table of Contents
Fetching ...

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

Ching Fang, Christopher Sandino, Behrooz Mahasseni, Juri Minxha, Hadi Pouransari, Erdrin Azemi, Ali Moin, Ellen Zippi

TL;DR

This work uses a masked autoencoding objective to pretrain a multimodal model, and shows that the model learns representations that can be linearly probed for a diverse set of downstream tasks, and hypothesized that cross-modal reconstruction objectives are important for successful multimodal training.

Abstract

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective given the diversity of physiological signals. This is partly due to challenges in multimodal health data: obtaining data across many patients is difficult and costly, there is a lot of inter-subject variability, and modalities are often heterogeneously informative across downstream tasks. Here, we explore these challenges in the PhysioNet 2018 dataset. We use a masked autoencoding objective to pretrain a multimodal model. We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks. We hypothesize that cross-modal reconstruction objectives are important for successful multimodal training, as they encourage the model to integrate information across modalities. We demonstrate that modality dropout in the input space improves performance across downstream tasks. We also find that late-fusion models pretrained with contrastive learning objectives are less effective across multiple tasks. Finally, we analyze the model's representations, showing that attention weights become more cross-modal and temporally aligned with our pretraining strategy. The learned embeddings also become more distributed in terms of the modalities encoded by each unit. Overall, our work demonstrates the utility of multimodal foundation models with health data, even across diverse physiological data sources. We further argue that explicit methods for inducing cross-modality may enhance multimodal pretraining strategies.

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

TL;DR

This work uses a masked autoencoding objective to pretrain a multimodal model, and shows that the model learns representations that can be linearly probed for a diverse set of downstream tasks, and hypothesized that cross-modal reconstruction objectives are important for successful multimodal training.

Abstract

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective given the diversity of physiological signals. This is partly due to challenges in multimodal health data: obtaining data across many patients is difficult and costly, there is a lot of inter-subject variability, and modalities are often heterogeneously informative across downstream tasks. Here, we explore these challenges in the PhysioNet 2018 dataset. We use a masked autoencoding objective to pretrain a multimodal model. We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks. We hypothesize that cross-modal reconstruction objectives are important for successful multimodal training, as they encourage the model to integrate information across modalities. We demonstrate that modality dropout in the input space improves performance across downstream tasks. We also find that late-fusion models pretrained with contrastive learning objectives are less effective across multiple tasks. Finally, we analyze the model's representations, showing that attention weights become more cross-modal and temporally aligned with our pretraining strategy. The learned embeddings also become more distributed in terms of the modalities encoded by each unit. Overall, our work demonstrates the utility of multimodal foundation models with health data, even across diverse physiological data sources. We further argue that explicit methods for inducing cross-modality may enhance multimodal pretraining strategies.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A. A 30-second sample from the training dataset. B. Data is split by patient identity for each part of the training procedure. The PhysioNet 2018 dataset consists of unlabeled data from 989 patients and labeled data from 996 patients, where each patient contributes 7.7 hours of data on average. The data for pretraining consists of all patients in the unlabeled dataset and 657 patients from the labeled dataset. The data for training and finetuning is drawn from the patients of the labeled dataset that were also used for pretraining. The data for the validation and test are drawn from the remaining patients of the labeled dataset not used for either pretraining or training. C. Diagram of the main pretraining strategy we use: multimodal masked autoencoding with modality drop in the input space. Tokenizers are modality-specific.
  • Figure 2: Measures of modality fusion across model representations. A. Attention rollout from tokens in the embeddings to tokens in the input. Here, the model is trained from scratch on sleep staging. Values are capped at 0.03 for comparisons with (BC). B. As in (A), but for the model pretrained with MAE. C. As in (A), but for the model pretrained with MAE and input modality drop. D. Relative source variance (RSV) of units across layers of the model in (A) to each of the four modalities. 95% confidence intervals shown, over 512 units in each embedding vector. EF. As in (D), but for the models in (B) and (C), respectively.
  • Figure 3: Hyperparameter selection in MAE models. A. Validation set accuracy score in the sleep staging task, with full-finetuning. Here, we show models pretrained with only MultiMAE-only. The x-axis shows the masking probability. B. As in (A), but for the model pretrained with MultiMAE and input modality drop.
  • Figure 4: Reconstruction performance of MultiMAE model with 70% masking. A. A random sample from the training data, with target signals in blue and reconstructed signals in orange. Plot is truncated at 20 seconds for visualization purposes. B. As in (A), but for another random sample
  • Figure 5: Contrastive learning architecture.
  • ...and 1 more figures