Table of Contents
Fetching ...

Specialized curricula for training vision-language models in retinal image analysis

Robbie Holland, Thomas R. P. Taylor, Christopher Holmes, Sophie Riedl, Julia Mai, Maria Patsiamanidi, Dimitra Mitsopoulou, Paul Hager, Philip Müller, Hendrik P. N. Scholl, Hrvoje Bogunović, Ursula Schmidt-Erfurth, Daniel Rueckert, Sobha Sivaprasad, Andrew J. Lotery, Martin J. Menten

TL;DR

This work demonstrates that generic vision-language models underperform on specialized ophthalmology tasks, especially AMD staging and referral decisions. By engineering a curriculum-based training pipeline that combines tabular biomarker data and high-quality specialist reports, RetinaVLM-Specialist emerges as a specialist VLM capable of producing accurate, biomarker-grounded imaging reports and near-parity with junior ophthalmologists on disease staging. In extensive evaluations, RetinaVLM-Specialist outperformed foundation medical VLMs and non-specialist baselines on AMD-related tasks, and even surpassed opticians in patient screening while approaching junior clinicians in accuracy. The study advocates for targeted, high-quality data curation and curriculum-driven instruction as a blueprint for deploying clinically useful medical VLMs, highlighting limitations such as domain shifts, hallucinations, and the need for broader disease coverage and 3D imaging support.

Abstract

Clinicians spend a significant amount of time reviewing medical images and transcribing their findings regarding patient diagnosis, referral and treatment in text form. Vision-language models (VLMs), which automatically interpret images and summarize their findings as text, have enormous potential to alleviate clinical workloads and increase patient access to high-quality medical care. While foundational models have stirred considerable interest in the medical community, it is unclear whether their general capabilities translate to real-world clinical utility. In this work, we demonstrate that OpenAI's ChatGPT-4o model, in addition to two foundation VLMs designed for medical use, markedly underperform compared to practicing ophthalmologists on specialist tasks crucial to the care of patients with age-related macular degeneration (AMD). To address this, we initially identified the essential capabilities required for image-based clinical decision-making, and then developed a curriculum to selectively train VLMs in these skills. The resulting model, RetinaVLM, can be instructed to write reports that significantly outperform those written by leading foundation medical VLMs and ChatGPT-4o in disease staging (F1 score of 0.63 vs. 0.33) and patient referral (0.67 vs. 0.50), and approaches the diagnostic performance of junior ophthalmologists (who achieve 0.77 and 0.78 on the respective tasks). Furthermore, in a single-blind reader study two senior ophthalmologists with up to 32 years of experience found RetinaVLM's reports were found to be substantially more accurate than those by ChatGPT-4o (64.3% vs. 14.3%). These results reinforce that our curriculum-based approach provides a blueprint towards specializing foundation medical VLMs for real-world clinical tasks.

Specialized curricula for training vision-language models in retinal image analysis

TL;DR

This work demonstrates that generic vision-language models underperform on specialized ophthalmology tasks, especially AMD staging and referral decisions. By engineering a curriculum-based training pipeline that combines tabular biomarker data and high-quality specialist reports, RetinaVLM-Specialist emerges as a specialist VLM capable of producing accurate, biomarker-grounded imaging reports and near-parity with junior ophthalmologists on disease staging. In extensive evaluations, RetinaVLM-Specialist outperformed foundation medical VLMs and non-specialist baselines on AMD-related tasks, and even surpassed opticians in patient screening while approaching junior clinicians in accuracy. The study advocates for targeted, high-quality data curation and curriculum-driven instruction as a blueprint for deploying clinically useful medical VLMs, highlighting limitations such as domain shifts, hallucinations, and the need for broader disease coverage and 3D imaging support.

Abstract

Clinicians spend a significant amount of time reviewing medical images and transcribing their findings regarding patient diagnosis, referral and treatment in text form. Vision-language models (VLMs), which automatically interpret images and summarize their findings as text, have enormous potential to alleviate clinical workloads and increase patient access to high-quality medical care. While foundational models have stirred considerable interest in the medical community, it is unclear whether their general capabilities translate to real-world clinical utility. In this work, we demonstrate that OpenAI's ChatGPT-4o model, in addition to two foundation VLMs designed for medical use, markedly underperform compared to practicing ophthalmologists on specialist tasks crucial to the care of patients with age-related macular degeneration (AMD). To address this, we initially identified the essential capabilities required for image-based clinical decision-making, and then developed a curriculum to selectively train VLMs in these skills. The resulting model, RetinaVLM, can be instructed to write reports that significantly outperform those written by leading foundation medical VLMs and ChatGPT-4o in disease staging (F1 score of 0.63 vs. 0.33) and patient referral (0.67 vs. 0.50), and approaches the diagnostic performance of junior ophthalmologists (who achieve 0.77 and 0.78 on the respective tasks). Furthermore, in a single-blind reader study two senior ophthalmologists with up to 32 years of experience found RetinaVLM's reports were found to be substantially more accurate than those by ChatGPT-4o (64.3% vs. 14.3%). These results reinforce that our curriculum-based approach provides a blueprint towards specializing foundation medical VLMs for real-world clinical tasks.
Paper Structure (60 sections, 1 equation, 20 figures, 2 tables)

This paper contains 60 sections, 1 equation, 20 figures, 2 tables.

Figures (20)

  • Figure 1: We introduce RetinaVLM, a specialist medical generative vision-language model (VLM). (a) Using a curriculum-based approach, we trained RetinaVLM in specialist medical skills that medical foundation VLMs are currently lacking (b) RetinaVLM is able to process retinal optical retinal optical coherence tomography (OCT) images and flexibly respond to text-based queries. (c) Its abilities entail the analysis of imaging biomarkers of age-related macular degeneration (AMD), disease staging, and the referral for treatment.
  • Figure 2: We curated a two-part curriculum to specialize medical VLMs for clinical use. (a and b) Based on a retrospectively collected OCT imaging dataset, we created a large number of tabular reports as well as a small number of comprehensive textual reports. (c and d) We then used an independent LLM to automatically generate visual question-answers based on these reports. (e and f) This yielded two VQA datasets, the first on basic imaging biomarkers of AMD and the second covering more advanced clinical skills. (g and h) Finally, we trained two specialist medical generative VLMs, RetinaVLM-Base and RetinaVLM-Specialist, using either the first or both VQA datasets.
  • Figure 3: (a) Comparison of the ability of four VLMs to write reports on retinal OCT images and derive the AMD stage. (b) Overall staging accuracy for each model was calculated using micro F1 scores with 95% CI, with tests of statistical significance calculated using McNemar's test. (c) Confusion matrices between the senior ophthalmologists' assessments (rows) against the image-based clinical decision maker's prediction (columns). (d) Qualitative comparison of reports written by human ophthalmologists and RetinaVLM-Specialist with text markings highlighting findings regarding biomarker observations and disease stage.
  • Figure 4: (a) Summary statistics of the quality of image reports written by ChatGPT-4o, RetinaVLM-Specialist and junior ophthalmologists, broken down by correctness, completeness and conciseness. Reports were scored for on each of the three criteria by senior ophthalmologists using a five-point Likert scale. (b) Representative reports with ratings by one of the senior ophthalmologists. As ChatGPT-4o tended to write excessively long reports, despite being prompted to shorten them, we display passages the senior ophthalmologists selected as the most important to their given rating. For verbose versions and additional sample reports by ChatGPT-4o see Supplementary Figure \ref{['fig:chatgpt_output']}.
  • Figure 5: (a) Evaluation of the ability of four VLMs to assess the need for patient referral for treatment of wet AMD. (b) Overall referral accuracy was calculated using F1 score for urgent referral with a 95% CI. Tests of statistical significance were carried out using McNemar's test. The performance of individual ophthalmologists is shown by two white points. (c) Confusion matrices between the senior ophthalmologists assessment (rows) against the image-based clinical decision maker's referral assessment (columns). (d) Image reports written by the non-specialist optician who originally referred the patient, compared with reports of the same patient written by RetinaVLM-Specialist.
  • ...and 15 more figures