GPT-4V Cannot Generate Radiology Reports Yet

Yuyang Jiang; Chacha Chen; Dang Nguyen; Benjamin M. Mervak; Chenhao Tan

GPT-4V Cannot Generate Radiology Reports Yet

Yuyang Jiang, Chacha Chen, Dang Nguyen, Benjamin M. Mervak, Chenhao Tan

TL;DR

A systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray finds that it fails terribly in both lexical metrics and clinical efficacy metrics, casting doubt on the viability of using GPT-4V in a radiology workflow.

Abstract

GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.

GPT-4V Cannot Generate Radiology Reports Yet

TL;DR

Abstract

Paper Structure (38 sections, 1 equation, 8 figures, 22 tables)

This paper contains 38 sections, 1 equation, 8 figures, 22 tables.

Introduction
Related Work
Experiment Setup
Method.
Dataset and pre-processing.
Evaluation metrics.
Results
Experiment 1: Can GPT-4V directly generate reports from images?
Our results are consistent across prompting strategies.
Experiment 2: Can GPT-4V interpret chest X-rays meaningfully?
Testing whether GPT-4V generates labels based on given chest X-rays.
Experiment 3: Given groundtruth conditions, can GPT-4V generate reports?
Human Evaluation
Limitations
Conclusions
...and 23 more sections

Figures (8)

Figure 1: Evaluation overview. In Experiment 1, we evaluate the out-of-box capability of GPT-4V. We further decompose the task into medical image reasoning (Experiment 2) and report synthesis (Experiment 3).
Figure 2: 95% Bootstrap confidence interval of three example conditions for MIMIC-CXR. When zero falls into the interval, at 95% confidence level, we cannot reject the null hypothesis that GPT-4V labels $j$-th condition independent of which condition group this study belongs to.
Figure 3: 95% Bootstrap confidence interval of top 6 conditions in our sample for GPT-4-vision-preview.
Figure 4: 95% Bootstrap confidence interval of top 5 conditions in our sample for GPT-4o.
Figure 5: Correlations between distributions of Pr(Pos) for different condition groups (GPT-4-vision-preview).
...and 3 more figures

GPT-4V Cannot Generate Radiology Reports Yet

TL;DR

Abstract

GPT-4V Cannot Generate Radiology Reports Yet

Authors

TL;DR

Abstract

Table of Contents

Figures (8)