Table of Contents
Fetching ...

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen

TL;DR

LMOD+ delivers a comprehensive multimodal ophthalmology benchmark (32,633 instances across 12 conditions and 5 imaging modalities) to develop and evaluate generative, multimodal LLMs. By unifying data curation, providing free-text prompts, and evaluating 24 MLLMs in tasks spanning anatomical recognition, disease screening, staging, and demographic prediction, the work reveals a substantial gap between current general-domain MLLMs and ophthalmology needs, with zero-shot disease screening around 58% accuracy and disease staging remaining difficult. The substantial dataset expansion, broader task coverage, and public leaderboard offer a resource to drive domain-specific model development and reduce vision-threatening disease burden through AI. The study also highlights strong performance for some models in specific sub-tasks (e.g., anatomical recognition by InternVL variants) but overall indicates that clinical-grade ophthalmic AI will require targeted modeling and data strategies beyond zero-shot transfer.

Abstract

Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

TL;DR

LMOD+ delivers a comprehensive multimodal ophthalmology benchmark (32,633 instances across 12 conditions and 5 imaging modalities) to develop and evaluate generative, multimodal LLMs. By unifying data curation, providing free-text prompts, and evaluating 24 MLLMs in tasks spanning anatomical recognition, disease screening, staging, and demographic prediction, the work reveals a substantial gap between current general-domain MLLMs and ophthalmology needs, with zero-shot disease screening around 58% accuracy and disease staging remaining difficult. The substantial dataset expansion, broader task coverage, and public leaderboard offer a resource to drive domain-specific model development and reduce vision-threatening disease burden through AI. The study also highlights strong performance for some models in specific sub-tasks (e.g., anatomical recognition by InternVL variants) but overall indicates that clinical-grade ophthalmic AI will require targeted modeling and data strategies beyond zero-shot transfer.

Abstract

Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.

Paper Structure

This paper contains 13 sections, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of dataset construction and representative data samples: (a) data curation pipeline; (b) examples from multiple ophthalmic imaging modalities with corresponding task settings.
  • Figure 2: Performance comparison of top-performing MLLMs across different ophthalmic imaging modalities. The radar charts display the performance of the top-F1-performing models, for each evaluation metric (Precision, Recall, F1, and HR) across five different imaging modalities: surgical scenes (SS), optical coherence tomography (OCT), color fundus photographs (CFP), scanning laser ophthalmoscopy (SLO), and lens photographs (LP).
  • Figure 3: Binary Eye Condition Diagnosis Accuracy Heatmap. Performance comparison of 23 MLLMs across 12 eye conditions. Color scale represents classification accuracy (0-1), with darker colors indicating superior diagnostic performance.
  • Figure 4: Performance comparison of MLLMs on multi-class eye disease diagnosis task. The scatter plot shows the relationship between model size (billions of parameters) and diagnostic accuracy on a four-class eye disease classification task using CFP images. Each point represents a different model. Connected lines within each model family show the performance progression across different parameter scales. The gray dashed line indicates random chance performance (25% for four-class classification). Selected LLaVA variants are labeled to distinguish between different architectural configurations.
  • Figure 5: Comparative Performance of MLLMs on Ophthalmologic Stage Diagnosis Tasks. Bar chart comparing accuracy of 10 selected MLLMs across three distinct ophthalmologic datasets requiring stage-based diagnosis: OIMHS Macular Hole (MH) Stage classification, ICDR severity grading, and SDRG. The horizontal dashed lines at 20% and 25% represent baseline performance thresholds. Models evaluated include InternVL variants (1.5-2B to 2.5-8B-MPO), LLaVA family models, LLaVA-Med-7B, QWen-7B, YI-VL-6B, and DeepSeek VL2-Tiny. ICDR demonstrates the highest achievable accuracies (up to 40%), while OIMHS MH Stage and SDRG show more consistent performance in the 15% - 25% range. InternVL 2.5-8B exhibits superior performance on ICDR compared to other models.
  • ...and 2 more figures