Table of Contents
Fetching ...

RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang

TL;DR

RetinalGPT tackles the gap between general multimodal language models and retinal image analysis by introducing a retina-focused, instruction-tuned multimodal assistant. It employs a two-stage training pipeline that first achieves feature alignment with an expanded retinal-biomedical vocabulary and then mixes retinal-specific clinical data with broad medical QA to preserve generic knowledge. Empirical results across eight retinal datasets demonstrate improved disease diagnosis, lesion localization, and quantitative vascular analysis, outperforming baselines and providing interpretable outputs. The work also shows generic medical-domain generalization across modalities, underscoring its potential as a practical clinical decision-support and research tool for medical imaging.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT

RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

TL;DR

RetinalGPT tackles the gap between general multimodal language models and retinal image analysis by introducing a retina-focused, instruction-tuned multimodal assistant. It employs a two-stage training pipeline that first achieves feature alignment with an expanded retinal-biomedical vocabulary and then mixes retinal-specific clinical data with broad medical QA to preserve generic knowledge. Empirical results across eight retinal datasets demonstrate improved disease diagnosis, lesion localization, and quantitative vascular analysis, outperforming baselines and providing interpretable outputs. The work also shows generic medical-domain generalization across modalities, underscoring its potential as a practical clinical decision-support and research tool for medical imaging.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT

Paper Structure

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Top: the process of obtaining data. Bottom: The collected data is converted into instruction-following data, categorized into Alignment (Bottom Left) and Tuning data (Bottom Right).
  • Figure 2: Overview of the model architecture and training strategy. Top: The network structure. Bottom: the two-stage training process.
  • Figure 3: The top section compares the our model can predict lesion locations (blue box) with ground truth annotations (red box). The bottom section presents vascular structure analysis, where the model can estimate fractal dimension, vessel density, and vessel width are compared against ground truth values.
  • Figure 4: Comparing our model and LLaVA-Med li2023llavamedtraininglargelanguageandvision on generic medical domain. The highlighted sections illustrate highly similar parts of the responses. Our model demonstrates ability to preserve knowledge in the generic medical domain.