Table of Contents
Fetching ...

A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images

Xiaoyi Liang, Mouxiao Bian, Moxin Chen, Lihao Liu, Junjun He, Jie Xu, Lin Li

TL;DR

This work benchmarks multimodal large language models on ophthalmic data by assembling a rigorously annotated dataset of 439 fundus images and 75 OCT images and evaluating seven API-accessible MLLMs across fundus and OCT tasks. Using a MedBench-inspired, API-based framework, the study reveals substantial variability in diagnostic accuracy across diseases, with consistent weaknesses in conditions like CNV and MYA and uneven cross-modality performance. It highlights that open-source status and parameter count influence performance but are not sole determinants, and it underscores the need for clinically grounded benchmarks with diverse data, expert validation, and richer multimodal context. The findings emphasize the importance of targeted model refinement and multimodal integration to advance ophthalmic diagnosis and clinical decision support.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.

A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images

TL;DR

This work benchmarks multimodal large language models on ophthalmic data by assembling a rigorously annotated dataset of 439 fundus images and 75 OCT images and evaluating seven API-accessible MLLMs across fundus and OCT tasks. Using a MedBench-inspired, API-based framework, the study reveals substantial variability in diagnostic accuracy across diseases, with consistent weaknesses in conditions like CNV and MYA and uneven cross-modality performance. It highlights that open-source status and parameter count influence performance but are not sole determinants, and it underscores the need for clinically grounded benchmarks with diverse data, expert validation, and richer multimodal context. The findings emphasize the importance of targeted model refinement and multimodal integration to advance ophthalmic diagnosis and clinical decision support.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Ophthalmic Benchmark
  • Figure 2: Normalized Accuracy of MLLMs in FPD
  • Figure 3: Accuracy of MLLMs in FPD by diseases
  • Figure 4: Normalized Accuracy of MLLMs in OCTD
  • Figure 5: Accuracy of MLLMs in OCTD by diseases
  • ...and 2 more figures