Table of Contents
Fetching ...

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Yutong Zhang, Yi Pan, Tianyang Zhong, Peixin Dong, Kangni Xie, Yuxiao Liu, Hanqi Jiang, Zhengliang Liu, Shijie Zhao, Tuo Zhang, Xi Jiang, Dinggang Shen, Tianming Liu, Xin Zhang

TL;DR

This work assesses multimodal large language models (MLLMs) for data mining of medical images and radiology free-text reports by comparing Gemini-series and GPT-series across 14 datasets and six tasks, including disease classification, lesion segmentation, localization, diagnosis, report generation, and lesion detection. It analyzes two major MLLM families—multimodality-alignment and multimodality-generation—alongside several baselines (Yi, Claude, Llama 3) to gauge zero-shot performance, generation efficiency, and cross-modal reasoning. Key findings show Gemini excels in report generation and lesion detection but falters in disease classification and localization, while GPT-series demonstrates strength in lesion segmentation and localization yet struggles with certain diagnostic tasks; both families offer generation-time advantages that could reduce physician workload but require further validation, safety, and regulatory considerations before clinical deployment. The study provides a benchmark-driven, cross-domain evaluation framework and highlights practical pathways for extending MLLMs to broader medical specialties while underscoring the need for comprehensive validation and governance in healthcare AI deployment.

Abstract

Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

TL;DR

This work assesses multimodal large language models (MLLMs) for data mining of medical images and radiology free-text reports by comparing Gemini-series and GPT-series across 14 datasets and six tasks, including disease classification, lesion segmentation, localization, diagnosis, report generation, and lesion detection. It analyzes two major MLLM families—multimodality-alignment and multimodality-generation—alongside several baselines (Yi, Claude, Llama 3) to gauge zero-shot performance, generation efficiency, and cross-modal reasoning. Key findings show Gemini excels in report generation and lesion detection but falters in disease classification and localization, while GPT-series demonstrates strength in lesion segmentation and localization yet struggles with certain diagnostic tasks; both families offer generation-time advantages that could reduce physician workload but require further validation, safety, and regulatory considerations before clinical deployment. The study provides a benchmark-driven, cross-domain evaluation framework and highlights practical pathways for extending MLLMs to broader medical specialties while underscoring the need for comprehensive validation and governance in healthcare AI deployment.

Abstract

Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.
Paper Structure (45 sections, 2 equations, 26 figures, 1 table)

This paper contains 45 sections, 2 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Schematic Overview of the Evaluation Tasks and Methods. (a). The core of the graph delineates the two focal tasks of our evaluation model, while the specific datasets or evaluation subjects pertinent to each task are outlined along its periphery. (b). Our testing tasks were executed by formulating suitable prompts and leveraging the online Language Model service offered by OpenAI and Google.
  • Figure 2: Chest: Case 1. In the context of the three-class classification tasks for pneumonia (normal, pneumonia, COVID-19), the green annotation indicates the correctly identified segments, while the red annotation denotes the incorrectly identified segments.
  • Figure 3: Ophthalmological Imaging: Case 1. Diagnosis of glaucoma and the task of localizing the macular fovea, the green annotation indicates the correctly identified segments, while the red annotation denotes the incorrectly identified segments.
  • Figure 4: Endoscopic: Case 1. The task of colon polyp localization and segmentation uses green markings to represent correct answers, red markings to represent incorrect answers, and yellow markings to highlight noteworthy content.
  • Figure 5: Skin: Case 1. The nine classification tasks for skin diseases utilize green markings to denote correct answers and red markings to indicate incorrect answers.
  • ...and 21 more figures