Table of Contents
Fetching ...

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, Yifan Peng

TL;DR

This study addresses whether GPT-4V can reliably detect radiologic findings from chest radiographs for clinical use. It uses a retrospective, multi-site dataset of 100 CXRs with radiologist-derived consensus references, evaluating zero-shot and few-shot GPT-4V performance in producing radiologic findings tables linked to ICD-10 codes and laterality. The results show limited accuracy in zero-shot settings with modest gains in few-shot scenarios, and substantial variability across datasets, indicating that GPT-4V is not yet ready for real-world chest radiograph interpretation. The findings highlight the need for task-specific, fine-tuned multimodal models and larger, more diverse datasets to achieve robust, generalizable clinical performance.

Abstract

The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

TL;DR

This study addresses whether GPT-4V can reliably detect radiologic findings from chest radiographs for clinical use. It uses a retrospective, multi-site dataset of 100 CXRs with radiologist-derived consensus references, evaluating zero-shot and few-shot GPT-4V performance in producing radiologic findings tables linked to ICD-10 codes and laterality. The results show limited accuracy in zero-shot settings with modest gains in few-shot scenarios, and substantial variability across datasets, indicating that GPT-4V is not yet ready for real-world chest radiograph interpretation. The findings highlight the need for task-specific, fine-tuned multimodal models and larger, more diverse datasets to achieve robust, generalizable clinical performance.

Abstract

The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.
Paper Structure (15 sections, 1 equation, 5 figures, 2 tables)

This paper contains 15 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Diagram shows the study workflow, including construction of data and application of GPT-4 and GPT-4 with vision (GPT-4V). CXR= chest radiograph, MIDRC= Medical Imaging and Data Resource Center, NIH= National Institutes of Health
  • Figure 2: An example of GPT-4 with vision (GPT-4V) inputs and output, including the (A) radiology report, (B) chest radiograph, (C) prompt provided to GPT-4V to create a table of radiologic findings derived from the chest radiograph, and (D) resultant table of radiographic findings generated by GPT-4V. AP = anteroposterior.
  • Figure 3: Bar graphs show the performance of GPT-4 with vision (GPT-4V) in the detection of radiologic findings from chest radiographs in the zero-shot setting, with statistical significance assessed using the two-tailed t-test, according to (A) the radiologic findings in International Statistical Classification of Diseases, Tenth Revision (ICD-10) codes only and (B) both the radiologic findings in ICD-10 codes and their corresponding lateralities.
  • Figure 4: Bar graphs show the performance of GPT-4 with vision (GPT-4V) in the detection of radiologic findings from chest radiographs in the few-shot setting, with statistical significance assessed using the two-tailed t test, according to (A) the radiologic findings in International Statistical Classification of Diseases, Tenth Revision (ICD-10) codes only and (B) both the radiologic findings in ICD-10 codes and their corresponding lateralities.
  • Figure 5: Bar graphs show the difference in performance of GPT-4 with vision (GPT-4V) in the detection of radiologic findings from chest radiographs between the zero-shot and few-shot settings, with statistical significance assessed using the two-tailed t-test, according to (A) the radiologic findings in International Statistical Classification of Diseases, Tenth Revision (ICD-10) codes only and (B) both the radiologic findings in ICD-10 codes and their corresponding lateralities.