LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Zhenyue Qin; Yu Yin; Dylan Campbell; Xuansheng Wu; Ke Zou; Yih-Chung Tham; Ninghao Liu; Xiuzhen Zhang; Qingyu Chen

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen

TL;DR

LMOD proposes a comprehensive, reproducible benchmark and dataset for evaluating large vision-language models in ophthalmology, integrating five imaging modalities with free-text and demographic information to support anatomical recognition and disease diagnosis. Across 13 state-of-the-art LVLMs, results reveal substantial gaps in ophthalmic image understanding and diagnostic reasoning, with accuracy staying far from clinical utility and simple supervised baselines outperforming LVLMs. The study conducts detailed error analysis, identifying six failure modes and showing demographic influences on model performance, while demonstrating that fine-tuning LVLMs on ophthalmic data yields limited gains. These findings underscore the urgent need for ophthalmology-specific LVLMs, robust uncertainty handling, and benchmark-driven validation to safely deploy AI-assisted ophthalmic decision support.

Abstract

The prevalence of vision-threatening eye diseases is a significant global burden, with many cases remaining undiagnosed or diagnosed too late for effective treatment. Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans, thereby reducing the burden on clinicians and improving access to eye care. However, limited benchmarks are available to assess LVLMs' performance in ophthalmology-specific applications. In this study, we introduce LMOD, a large-scale multimodal ophthalmology benchmark consisting of 21,993 instances across (1) five ophthalmic imaging modalities: optical coherence tomography, color fundus photographs, scanning laser ophthalmoscopy, lens photographs, and surgical scenes; (2) free-text, demographic, and disease biomarker information; and (3) primary ophthalmology-specific applications such as anatomical information understanding, disease diagnosis, and subgroup analysis. In addition, we benchmarked 13 state-of-the-art LVLM representatives from closed-source, open-source, and medical domains. The results demonstrate a significant performance drop for LVLMs in ophthalmology compared to other domains. Systematic error analysis further identified six major failure modes: misclassification, failure to abstain, inconsistent reasoning, hallucination, assertions without justification, and lack of domain-specific knowledge. In contrast, supervised neural networks specifically trained on these tasks as baselines demonstrated high accuracy. These findings underscore the pressing need for benchmarks in the development and validation of ophthalmology-specific LVLMs.

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 13 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 13 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Advances in LVLMs
Lack of Benchmarks
LMOD Curation
Data Curation Pipeline
Benchmarking Results
Benchmarked LVLMs
Evaluation Metrics
Anatomical Recognition
Diagnosis Analysis
Benchmark Justifications
Limitations
Conclusion
Prompts for Benchmarking
...and 18 more sections

Figures (13)

Figure 1: LVLM response examples for macular hole staging.
Figure 2: Overview of our data processing and evaluation pipeline for assessing the performance of LVLMs. The raw information is preprocessed to extract structured data such as bounding boxes and disease conditions. This aggregated information is then used to generate prompts for the LVLMs to identify the type of each labeled region, or conduct diagnosis analysis. The LVLMs processes the input image and prompt to generate responses categorizing each region or disease or describing diseases. Finally, the model's output is compared against the ground truth results using our proposed evaluation metrics.
Figure 3: Performance comparison of top-performing LVLMs across different ophthalmic imaging modalities. The radar charts display the performance of the top-F1-performing models, for each evaluation metric (Precision, Recall, F1, and HR) across five different imaging modalities: surgical scenes (SS), optical coherence tomography (OCT), color fundus photographs (CFP), scanning laser ophthalmoscopy (SLO), and lens photographs (LP).
Figure 4: Visual examples of LVLM predictions for anatomical recognition in OCT images. The figure presents a comparison of ground truth (GT) annotations and predictions from three representative LVLMs: GPT-4o, LLaVA-M-7B, and VILA-3B. Green ticks indicate correct predictions, while red crosses mark incorrect ones. VILA-3B generates an invalid response consisting of a sequence of numbers unrelated to the task.
Figure 5: Robustness analysis of LVLMs across different color fundus photograph datasets. The bar chart displays the F1 scores of the five models (GPT-4o, LLaVA-Med, LLaVA-M-7B, InternVL-2B, and InternVL-4B) on four different color fundus photograph datasets: IDRID, ORIGA, REFUGE, and G1020.
...and 8 more figures

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

TL;DR

Abstract

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)