Table of Contents
Fetching ...

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Fangqi Cheng, Surajit Ray, Xiaochen Yang

TL;DR

This work presents a data-efficient fine-tuning pipeline for 3D vision-language models to diagnose Alzheimer's disease from MRI, by converting subject metadata into synthetic reports, adding an auxiliary MMSE-predicting token, and applying prompt tuning with cross-attention to align image-text representations. The approach achieves state-of-the-art performance on the ADNI dataset with only 1,504 training MRIs and demonstrates strong zero-shot generalization to OASIS-2 and AIBL, outperforming several larger, fully trained baselines. Key contributions include (i) a metadata-to-text augmentation strategy, (ii) an MMSE supervision signal to inject clinical knowledge, and (iii) a PEFT-based, cross-attentive 3D Med-VLM framework that remains data-efficient. Overall, the method advances practical clinical deployment by enabling accurate AD diagnosis from 3D MRI with substantially fewer labeled examples and improved interpretability via biomarker-focused attention.

Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on ADNI with only 1,504 training MRIs, outperforming methods trained on 27,161 MRIs, and shows strong zero-shot generalization on OASIS-2 and AIBL. Code is available at https://github.com/CFQ666312/DEFT-VLM-AD.

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

TL;DR

This work presents a data-efficient fine-tuning pipeline for 3D vision-language models to diagnose Alzheimer's disease from MRI, by converting subject metadata into synthetic reports, adding an auxiliary MMSE-predicting token, and applying prompt tuning with cross-attention to align image-text representations. The approach achieves state-of-the-art performance on the ADNI dataset with only 1,504 training MRIs and demonstrates strong zero-shot generalization to OASIS-2 and AIBL, outperforming several larger, fully trained baselines. Key contributions include (i) a metadata-to-text augmentation strategy, (ii) an MMSE supervision signal to inject clinical knowledge, and (iii) a PEFT-based, cross-attentive 3D Med-VLM framework that remains data-efficient. Overall, the method advances practical clinical deployment by enabling accurate AD diagnosis from 3D MRI with substantially fewer labeled examples and improved interpretability via biomarker-focused attention.

Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on ADNI with only 1,504 training MRIs, outperforming methods trained on 27,161 MRIs, and shows strong zero-shot generalization on OASIS-2 and AIBL. Code is available at https://github.com/CFQ666312/DEFT-VLM-AD.

Paper Structure

This paper contains 23 sections, 12 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the fine-tuning pipeline. The model is trained for image-text alignment and MMSE prediction. The upper part shows report processing with learnable prompts, while the lower part illustrates MRI processing with visual prompts and an auxiliary token. Cross-attention refines image-text feature alignment.
  • Figure 2: Heatmap of integrated gradients showing the importance of each biomarker for different disease progression stages (darker color indicates higher importance).
  • Figure 3: Visualization of Grad-ECLIP attention heatmaps generated using different textual inputs, illustrating the individual contributions of Entorhinal Volume, Ventricular Size, and Whole Brain Volume to image-text alignment. The warmer parts of the heatmaps indicate regions where the model places greater importance when matching the image with the given biomarker text.Each row corresponds to MRI scans of the same subject at different clinical stages: NC,MCI,AD