Table of Contents
Fetching ...

COMPRER: A Multimodal Multi-Objective Pretraining Framework for Enhanced Medical Image Representation

Guy Lutsker, Hagai Rossman, Nastya Godiva, Eran Segal

TL;DR

This work presents COMPRER, a novel multi-modal, multi-objective pretraining framework which enhances medical-image representation, diagnostic inferences, and prognosis of diseases, and introduces a novel evaluation metric, providing deeper understanding of the effectiveness of the latent space pairing.

Abstract

Substantial advances in multi-modal Artificial Intelligence (AI) facilitate the combination of diverse medical modalities to achieve holistic health assessments. We present COMPRER , a novel multi-modal, multi-objective pretraining framework which enhances medical-image representation, diagnostic inferences, and prognosis of diseases. COMPRER employs a multi-objective training framework, where each objective introduces distinct knowledge to the model. This includes a multimodal loss that consolidates information across different imaging modalities; A temporal loss that imparts the ability to discern patterns over time; Medical-measure prediction adds appropriate medical insights; Lastly, reconstruction loss ensures the integrity of image structure within the latent space. Despite the concern that multiple objectives could weaken task performance, our findings show that this combination actually boosts outcomes on certain tasks. Here, we apply this framework to both fundus images and carotid ultrasound, and validate our downstream tasks capabilities by predicting both current and future cardiovascular conditions. COMPRER achieved higher Area Under the Curve (AUC) scores in evaluating medical conditions compared to existing models on held-out data. On the Out-of-distribution (OOD) UK-Biobank dataset COMPRER maintains favorable performance over well-established models with more parameters, even though these models were trained on $75\times$ more data than COMPRER. In addition, to better assess our model's performance in contrastive learning, we introduce a novel evaluation metric, providing deeper understanding of the effectiveness of the latent space pairing.

COMPRER: A Multimodal Multi-Objective Pretraining Framework for Enhanced Medical Image Representation

TL;DR

This work presents COMPRER, a novel multi-modal, multi-objective pretraining framework which enhances medical-image representation, diagnostic inferences, and prognosis of diseases, and introduces a novel evaluation metric, providing deeper understanding of the effectiveness of the latent space pairing.

Abstract

Substantial advances in multi-modal Artificial Intelligence (AI) facilitate the combination of diverse medical modalities to achieve holistic health assessments. We present COMPRER , a novel multi-modal, multi-objective pretraining framework which enhances medical-image representation, diagnostic inferences, and prognosis of diseases. COMPRER employs a multi-objective training framework, where each objective introduces distinct knowledge to the model. This includes a multimodal loss that consolidates information across different imaging modalities; A temporal loss that imparts the ability to discern patterns over time; Medical-measure prediction adds appropriate medical insights; Lastly, reconstruction loss ensures the integrity of image structure within the latent space. Despite the concern that multiple objectives could weaken task performance, our findings show that this combination actually boosts outcomes on certain tasks. Here, we apply this framework to both fundus images and carotid ultrasound, and validate our downstream tasks capabilities by predicting both current and future cardiovascular conditions. COMPRER achieved higher Area Under the Curve (AUC) scores in evaluating medical conditions compared to existing models on held-out data. On the Out-of-distribution (OOD) UK-Biobank dataset COMPRER maintains favorable performance over well-established models with more parameters, even though these models were trained on more data than COMPRER. In addition, to better assess our model's performance in contrastive learning, we introduce a novel evaluation metric, providing deeper understanding of the effectiveness of the latent space pairing.
Paper Structure (16 sections, 3 equations, 5 figures, 1 algorithm)

This paper contains 16 sections, 3 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Schematic representation of COMPRER, our Contrastive Multi-objective Pretraining approach for Multi-modal Representation. We utilize ViT-Base encoders equipped with DINOV2 pre-trained weights for processing each imaging modality, accompanied by a linear projection head. Our method is defined by a combination of multiple loss objectives: (1) an ID-centric multi-modal contrastive loss that bridges features between fundus images and carotid ultrasound images; (2) a patient visit-based contrastive loss that discerns temporal discrepancies across repeat visits for each patient; (3) a contrastive scheme for the bilateral fundus images to ensure coupling of right and left eye data per patient; (4) a decoding objective to restore original images from condensed latent representations; and (5) a predictive mechanism to estimate general medical measures directly from modality-specific embeddings.
  • Figure 2: Interval Validation, 2.A (left panel). Prediction accuracy of various medical measures over iterative steps. This figure illustrates the test R2 values achieved during the training of predictive models for four different medical measures: fractal dimension, age, vessel density, and artery average width. The x-axis is the training steps iterations (in tens of thousands), while the y-axis indicates the R2 value observed on held-out test data. 2.B (right panel). Evolution of Top-K Test Accuracy in Contrastive Learning for Multimodal Image Matching. This figure depicts the test accuracy of a contrastive learning model for two different values of k: 25 and 100, as labeled Top-25 (orange line) and Top-100 (blue line), respectively. The x-axis is, again, training steps iterations, and the y-axis denotes the Top-K Test Accuracy. The Top-100 accuracy increases as training goes on, reaching top performance of 0.65. Similarly, the Top-25 accuracy ends around 0.35.
  • Figure 3: Comparative Analysis of Top-K Contrastive Metric on COMPRER and MMCL. We see in green, purple, and red the Top-100, Top-25, Top-5 contrastive metric respectively. We see in a dotted line the MMCL model, which trained solely on the multi-modal contrastive loss (trained using only $\mathcal{L}_{\text{contr\_rc}}$ ), and in the straight line COMPRER which trained on multiple objectives, including the multi-modal contrastive loss. In 3.A We see the multiplicative tok-K metric, and in 3.B we see the top-K metric. We see that COMPRER consistently outperforms MMCL.
  • Figure 4: Comparative Analysis of Model Performance in Predicting Cardiovascular Conditions from Fundus Images. 4.A illustrates the AUC results for various models fully finetuned to predicting current cardiovascular condition based on fundus images taken at the participants' first visit. The models compared include, our proposed COMPRER, MMCL, DINOV2-Base, and a Retina Foundation Model. The boxplots represent the distribution of AUC scores achieved on the test set after selection of top 5 hyperpameters for each model based on validation performance. 4.B depicts the AUC results for the same set of models, but focuses on predicting the future onset of cardiovascular conditions. This analysis only includes participants who were healthy at baseline, with the goal of determining if the models can predict the development of cardiovascular conditions at a follow-up visit. While the AUC scores on their own are not high, COMPRER consistently outperforms the competition. It's also worth observing that MMCL has also some gains with respect to DINOV2 and the retina foundation model, which goes to show that even just applying multi-modal contrastive learning on its own can increase downstream performance.
  • Figure 5: Comparative Analysis of COMPRER and Retina Foundation Model Performance in Predicting Diseases from Fundus Image Representations on an External Dataset (UKBB). The prediction is done from frozen image embeddings using logistic regression. We see that COMPRER outperforms the Retina Foundation model in the prediction of ischaemic stroke, evidenced by a higher AUC value. Conversely, the Retina Foundation model exhibits superior performance in predicting chronic ischaemic heart disease, although COMPRER also shows a test AUC above 0.6. For the conditions of stroke, heart failure, and acute ischaemic heart disease, the COMPRER method slightly outperforms the Retina Foundation model, though the margins are not significant. Its also notable that the the retina foundation model was trained using $75\times$ more fundus images, and with a $3.5\times$ larger model.