Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs
Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, Yifan Peng
TL;DR
This study addresses whether GPT-4V can reliably detect radiologic findings from chest radiographs for clinical use. It uses a retrospective, multi-site dataset of 100 CXRs with radiologist-derived consensus references, evaluating zero-shot and few-shot GPT-4V performance in producing radiologic findings tables linked to ICD-10 codes and laterality. The results show limited accuracy in zero-shot settings with modest gains in few-shot scenarios, and substantial variability across datasets, indicating that GPT-4V is not yet ready for real-world chest radiograph interpretation. The findings highlight the need for task-specific, fine-tuned multimodal models and larger, more diverse datasets to achieve robust, generalizable clinical performance.
Abstract
The study examines the application of GPT-4V, a multi-modal large language model equipped with visual recognition, in detecting radiological findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.
