Table of Contents
Fetching ...

VOLMO: Versatile and Open Large Models for Ophthalmology

Zhenyue Qin, Younjoon Chung, Elijah Lee, Wanyue Feng, Xuguang Ai, Serina Applebaum, Minjie Zou, Yang Liu, Pan Xiao, Mac Singer, Amisha Dave, Aidan Gilson, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ron Adelman, Luciano V. Del Priore, Qingyu Chen

Abstract

Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.

VOLMO: Versatile and Open Large Models for Ophthalmology

Abstract

Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.
Paper Structure (38 sections, 9 equations, 12 figures, 9 tables)

This paper contains 38 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overview of VOLMO development pipeline and multi-round clinical reasoning. The framework consists of three components: (Left) Dataset Extraction from publicly available sources including ophthalmological literature (PubMed Central), benchmarking datasets (Hugging Face, GitHub, etc.), and patient clinical profiles with diagnosis reports. (Center) Model Training through a three-stage progressive framework: Stage 1 - Ophthalmology knowledge pretraining using 86,965 image-text pairs to inject foundational domain knowledge; Stage 2 - Domain task fine-tuning on 26,929 disease-labeled instances across 12 conditions and signs for disease screening and staging; Stage 3 - Reasoning and synthesis training on 913 comprehensive case reports to enable clinical assessment generation. Snowflake icons indicate frozen components during training. (Right) Multi-round Conversation Example demonstrating VOLMO's clinical reasoning workflow, where the model sequentially generates differential diagnoses, determines the most likely diagnosis, formulates clinical assessments and treatment plans, recommends specific treatments, and provides follow-up care guidance based on patient clinical profiles and multi-modal ophthalmological imaging.
  • Figure 2: Process of assessment and management generation. Given a patient clinical profile including medical history, ocular history, family history, presenting symptoms, and multi-modal ophthalmology imaging data (left panel), VOLMO generates diagnostic reports through five sequential reasoning stages (right panel): (1) Differential Diagnoses - listing potential conditions that could explain the patient's clinical presentation, (2) Most Likely Diagnosis - the single condition determined to be the most probable explanation with justification, (3) Assessments and Plans - evaluating the patient's current clinical status including disease severity and stage, (4) Treatment - recommending therapeutic interventions tailored to address the diagnosed condition, and (5) Follow-up Care - providing ongoing management plans including monitoring parameters and anticipated outcomes.
  • Figure 3: Visual interpretation of VOLMO's attention patterns for the query word scar. Top two rows show fundus images with retinal scars (first row) and corresponding attention heatmaps (second row). Bottom two rows display fundus images without scars (third row) and their attention heatmaps (fourth row). Attention intensity is visualized using a color scale ranging from blue (low attention) through green and yellow to red (high attention).
  • Figure 4: Comparison of medical AI model responses on ophthalmological image description tasks.
  • Figure 5: Comparison of medical AI model responses on ophthalmological image description tasks (continued).
  • ...and 7 more figures