Table of Contents
Fetching ...

FunBench: Benchmarking Fundus Reading Skills of MLLMs

Qijie Wei, Kaiheng Qian, Xirong Li

TL;DR

FunBench introduces a hierarchical, task-diverse VQA benchmark to rigorously assess fundus-reading capabilities of Multimodal Large Language Models. By decomposing tasks into four levels (modality, anatomy, lesion, and disease) and offering three evaluation modes—VE linear probing, knowledge-guided LLM evaluation, and holistic end-to-end testing—it enables modular and holistic analysis of VE and LLM contributions. Evaluations across nine open-source MLLMs plus GPT-4o show pervasive gaps, especially in basic tasks like laterality, with results indicating heavy reliance on LLMs and relatively weaker vision encoders; domain-specific ophthalmic training is identified as a key determinant of performance. The findings highlight the need for ophthalmology-focused LLMs and improved VEs to achieve reliable fundus analysis in clinical settings, guiding future benchmark design and model training. FunBench provides a scalable, reproducible framework for dissecting fundus-reading capabilities in MLLMs and advancing domain-specific multimodal medical AI.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs' fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on nine open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

FunBench: Benchmarking Fundus Reading Skills of MLLMs

TL;DR

FunBench introduces a hierarchical, task-diverse VQA benchmark to rigorously assess fundus-reading capabilities of Multimodal Large Language Models. By decomposing tasks into four levels (modality, anatomy, lesion, and disease) and offering three evaluation modes—VE linear probing, knowledge-guided LLM evaluation, and holistic end-to-end testing—it enables modular and holistic analysis of VE and LLM contributions. Evaluations across nine open-source MLLMs plus GPT-4o show pervasive gaps, especially in basic tasks like laterality, with results indicating heavy reliance on LLMs and relatively weaker vision encoders; domain-specific ophthalmic training is identified as a key determinant of performance. The findings highlight the need for ophthalmology-focused LLMs and improved VEs to achieve reliable fundus analysis in clinical settings, guiding future benchmark design and model training. FunBench provides a scalable, reproducible framework for dissecting fundus-reading capabilities in MLLMs and advancing domain-specific multimodal medical AI.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs' fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on nine open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Paper Structure

This paper contains 12 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Proposed FunBench for assessing an MLLM's funding reading skills by (a) varied-level tasks and three distinct evaluation modes, i.e.(b) E-mode I: linear-probe based vision encoder (VE) evaluation, (c) E-mode II: knowledge-prompted large language model (LLM) evaluation and (d) E-mode III: holistic evaluation.