Table of Contents
Fetching ...

Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification

Ziyu Yang, Santhosh Cherian, Slobodan Vucetic

TL;DR

This study explores the suitability of large language models in automatically generating simplifications in radiology reports, and examines the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain.

Abstract

Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.

Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification

TL;DR

This study explores the suitability of large language models in automatically generating simplifications in radiology reports, and examines the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain.

Abstract

Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.

Paper Structure

This paper contains 34 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Layperson evaluation of radiology report simplifications. (a) (left panel) evaluates whether laypeople understand the original sentence. (b) (middle panel) evaluates whether simplification improves understanding. (c) (right panel) evaluates the preferences given a set of candidate simplifications and asks for justification.
  • Figure 2: The workflow of self-correction mechanism. Processor agent decides when to stop the process.
  • Figure 3: Distribution of confidence level (Q2) by laypeople given the original sentence and four types of simplifications
  • Figure 4: The horizontal stacked histogram of laypeople vote distribution for the most and least preferred simplifications.
  • Figure 5: Expert evaluation of radiology report simplification. (left panel) lists instructions, (left panel) is a survey form with text boxes for ratings and justification.
  • ...and 2 more figures