Table of Contents
Fetching ...

VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis

Duy-Cat Can, Linh D. Dang, Quang-Huy Tang, Dang Minh Ly, Huong Ha, Guillaume Blanc, Oliver Y. Chén, Binh T. Nguyen

TL;DR

VisTA presents a vision-text alignment framework for Alzheimer's disease diagnosis that integrates radiology images with expert-verified abnormalities and descriptions through contrastive learning. Built on BiomedCLIP and refined with a small, curated MINDset dataset, VisTA outputs abnormality type, similarity-based evidence, explanations, and final AD predictions in a modular, clinically aligned pipeline. It achieves strong abnormality retrieval (74% accuracy, AUC 0.87) and dementia prediction (88% accuracy, AUC 0.82) on MINDset, surpassing baselines trained on millions of images, while generating explanations that align with human expert judgments. Limitations include the small size of MINDset and challenges in differentiating AD from other dementias, with future work focused on expanding data, incorporating additional modalities, and validating deployment in clinical settings.

Abstract

Objective: Assessing Alzheimer's disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making. Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA's efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA's explainability, we compared its explanations with human experts' explanations. Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts' and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation.

VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis

TL;DR

VisTA presents a vision-text alignment framework for Alzheimer's disease diagnosis that integrates radiology images with expert-verified abnormalities and descriptions through contrastive learning. Built on BiomedCLIP and refined with a small, curated MINDset dataset, VisTA outputs abnormality type, similarity-based evidence, explanations, and final AD predictions in a modular, clinically aligned pipeline. It achieves strong abnormality retrieval (74% accuracy, AUC 0.87) and dementia prediction (88% accuracy, AUC 0.82) on MINDset, surpassing baselines trained on millions of images, while generating explanations that align with human expert judgments. Limitations include the small size of MINDset and challenges in differentiating AD from other dementias, with future work focused on expanding data, incorporating additional modalities, and validating deployment in clinical settings.

Abstract

Objective: Assessing Alzheimer's disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making. Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA's efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA's explainability, we compared its explanations with human experts' explanations. Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts' and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation.

Paper Structure

This paper contains 63 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of Datasets for Abnormality Retrieval and Alzheimer's Disease Prediction. Summary of datasets used in the study, including the constructed reference dataset and publicly available Alzheimer's disease datasets. (a) The reference dataset (MINDset) contains medical radiology images paired with descriptions and abnormality types (e.g., normal, medial temporal lobe atrophy, white matter hyperintensities, and other atrophy), verified by medical experts. (b) The public dataset includes MRI scans and diagnostic labels for Alzheimer's disease prediction. Key statistics, such as the number of images, types of abnormalities, and verification process, are also shown.
  • Figure 2: Model Architecture and Workflow of the Proposed Method.(a) The schematic representation of the VisTA architecture, showing its pre-trained components and the fine-tuning process using contrastive learning to align image and text embeddings. The architecture incorporates multimodal learning to capture relationships between radiology images and their descriptions and abnormality type. (b) The workflow of the VisTA framework, with its end-to-end pipeline: (1) input radiology image, (2) retrieval of reference abnormalities with similarity scoring, (3) generation of descriptive explanations, and (4) prediction of Alzheimer's disease based on the retrieved evidence. The pipeline highlights the modularity and explainability of VisTA: it ensures reliability and alignment with real-world clinical diagnostic workflows.
  • Figure 3: Confusion Matrices for Abnormality Type Classification and Disease Prediction. Confusion matrices illustrate the performance of pre-trained BiomedCLIP and VisTA models across abnormality type classification (Panels a-d), Dementia prediction (Panels e-h), and AD prediction (Panels i-l). Panels a-d depict the classification accuracy for identifying specific abnormality types using the pre-trained BiomedCLIP model and three versions of VisTA. Panels e-h and i-l show the performance for dementia and AD (binary) prediction. The confusion matrices highlight the improvements in classification accuracy and reduction of misclassifications achieved through contrastive learning on MINDset medical references.
  • Figure 4: Case Study Examples of VisTA for Delivering Evidence-Driven Explainable Alzheimer's Disease Diagnosis. VisTA input radiology images, and output explainable, evidence-based predictions that aligned with real-world clinical reasoning processes.
  • Figure 5: The Embedding Space of Image and Text Representations. The panels show the embedding space from the final encoder used in the diagnosis step, integrating both image embeddings and textual embeddings of abnormality types. (a) Embeddings from the pre-trained BiomedCLIP model, showing initial alignment between radiology image representations and fixed abnormality type text embeddings. Clustering of abnormalities is present but lacks strong separation. (b) VisTA embeddings, showing a clear separation between abnormality types, with improved multimodal alignment between image and text representations. The results highlight the effectiveness of contrastive learning in enhancing embedding quality, ensuring better differentiation between abnormality categories.
  • ...and 1 more figures