Specialized curricula for training vision-language models in retinal image analysis
Robbie Holland, Thomas R. P. Taylor, Christopher Holmes, Sophie Riedl, Julia Mai, Maria Patsiamanidi, Dimitra Mitsopoulou, Paul Hager, Philip Müller, Hendrik P. N. Scholl, Hrvoje Bogunović, Ursula Schmidt-Erfurth, Daniel Rueckert, Sobha Sivaprasad, Andrew J. Lotery, Martin J. Menten
TL;DR
This work demonstrates that generic vision-language models underperform on specialized ophthalmology tasks, especially AMD staging and referral decisions. By engineering a curriculum-based training pipeline that combines tabular biomarker data and high-quality specialist reports, RetinaVLM-Specialist emerges as a specialist VLM capable of producing accurate, biomarker-grounded imaging reports and near-parity with junior ophthalmologists on disease staging. In extensive evaluations, RetinaVLM-Specialist outperformed foundation medical VLMs and non-specialist baselines on AMD-related tasks, and even surpassed opticians in patient screening while approaching junior clinicians in accuracy. The study advocates for targeted, high-quality data curation and curriculum-driven instruction as a blueprint for deploying clinically useful medical VLMs, highlighting limitations such as domain shifts, hallucinations, and the need for broader disease coverage and 3D imaging support.
Abstract
Clinicians spend a significant amount of time reviewing medical images and transcribing their findings regarding patient diagnosis, referral and treatment in text form. Vision-language models (VLMs), which automatically interpret images and summarize their findings as text, have enormous potential to alleviate clinical workloads and increase patient access to high-quality medical care. While foundational models have stirred considerable interest in the medical community, it is unclear whether their general capabilities translate to real-world clinical utility. In this work, we demonstrate that OpenAI's ChatGPT-4o model, in addition to two foundation VLMs designed for medical use, markedly underperform compared to practicing ophthalmologists on specialist tasks crucial to the care of patients with age-related macular degeneration (AMD). To address this, we initially identified the essential capabilities required for image-based clinical decision-making, and then developed a curriculum to selectively train VLMs in these skills. The resulting model, RetinaVLM, can be instructed to write reports that significantly outperform those written by leading foundation medical VLMs and ChatGPT-4o in disease staging (F1 score of 0.63 vs. 0.33) and patient referral (0.67 vs. 0.50), and approaches the diagnostic performance of junior ophthalmologists (who achieve 0.77 and 0.78 on the respective tasks). Furthermore, in a single-blind reader study two senior ophthalmologists with up to 32 years of experience found RetinaVLM's reports were found to be substantially more accurate than those by ChatGPT-4o (64.3% vs. 14.3%). These results reinforce that our curriculum-based approach provides a blueprint towards specializing foundation medical VLMs for real-world clinical tasks.
