Table of Contents
Fetching ...

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

Dimitrios P. Panagoulias, Maria Virvou, George A. Tsihrintzis

TL;DR

This work addresses the challenge of evaluating LLM-driven medical diagnoses from multimodal inputs by introducing a two-step paradigm: (1) multimodal LLM evaluation using structured image-plus-text MCQs and (2) domain-specific analysis leveraging Image Metadata Analysis, Named Entity Recognition, and Knowledge Graphs to identify actionable fine-tuning paths. Applying this framework to pathology MCQs, the authors demonstrate that GPT-4-Vision-Preview attains about 84% diagnostic accuracy, and they provide a granular analysis of errors across image domains, entities, and knowledge graphs. The approach yields detailed insights into model weaknesses, particularly in cardiovascular and endocrine knowledge paths, and offers a generalizable methodology and open-source tooling for evaluating and optimizing other multimodal LLMs in clinical settings. Overall, the study contributes a rigorous, reproducible framework for validating multimodal clinical AI and guiding targeted improvements with practical impact for safe medical deployment.

Abstract

Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology which is rapidly evolving and promises to aid in medical diagnosis. However, the correctness and the accuracy of their returns has not yet been properly evaluated. In this work, we propose an LLM evaluation paradigm that incorporates two independent steps of a novel methodology, namely (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted via the previous interactions. Using this paradigm, (1) we evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a systemic and comprehensive analysis of extracted results. We used GPT-4-Vision-Preview as the LLM to respond to complex, medical questions consisting of both images and text, and we explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite well, scoring approximately 84\% of correct diagnoses. Next, we further analyzed the findings of our work, following an analytical approach which included Image Metadata Analysis, Named Entity Recognition and Knowledge Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge paths, leading to a further understanding of its shortcomings in specific areas. Our methodology and findings are not limited to the use of GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the usefulness and accuracy of other LLMs and, thus, improve their use with further optimization.

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

TL;DR

This work addresses the challenge of evaluating LLM-driven medical diagnoses from multimodal inputs by introducing a two-step paradigm: (1) multimodal LLM evaluation using structured image-plus-text MCQs and (2) domain-specific analysis leveraging Image Metadata Analysis, Named Entity Recognition, and Knowledge Graphs to identify actionable fine-tuning paths. Applying this framework to pathology MCQs, the authors demonstrate that GPT-4-Vision-Preview attains about 84% diagnostic accuracy, and they provide a granular analysis of errors across image domains, entities, and knowledge graphs. The approach yields detailed insights into model weaknesses, particularly in cardiovascular and endocrine knowledge paths, and offers a generalizable methodology and open-source tooling for evaluating and optimizing other multimodal LLMs in clinical settings. Overall, the study contributes a rigorous, reproducible framework for validating multimodal clinical AI and guiding targeted improvements with practical impact for safe medical deployment.

Abstract

Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology which is rapidly evolving and promises to aid in medical diagnosis. However, the correctness and the accuracy of their returns has not yet been properly evaluated. In this work, we propose an LLM evaluation paradigm that incorporates two independent steps of a novel methodology, namely (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted via the previous interactions. Using this paradigm, (1) we evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a systemic and comprehensive analysis of extracted results. We used GPT-4-Vision-Preview as the LLM to respond to complex, medical questions consisting of both images and text, and we explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite well, scoring approximately 84\% of correct diagnoses. Next, we further analyzed the findings of our work, following an analytical approach which included Image Metadata Analysis, Named Entity Recognition and Knowledge Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge paths, leading to a further understanding of its shortcomings in specific areas. Our methodology and findings are not limited to the use of GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the usefulness and accuracy of other LLMs and, thus, improve their use with further optimization.
Paper Structure (19 sections, 11 figures, 1 table)

This paper contains 19 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Multimodal LLM evaluation
  • Figure 2: Domain specific Analysis
  • Figure 5: Total Scores per Domain
  • Figure 6: Incorrect responses based on images used
  • Figure 7: Correct responses based on images used
  • ...and 6 more figures