Table of Contents
Fetching ...

CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling

Jawook Gu, Kihyun You, Han-Cheol Cho, Jiho Kim, Eun Kyoung Hong, Byungseok Roh

TL;DR

A BERT-based labeler is trained, CheX-GPT, which operates faster and more efficiently than its GPT counterpart, and not only excels in labeling accuracy over existing models, but also showcases superior efficiency, flexibility, and scalability.

Abstract

Free-text radiology reports present a rich data source for various medical tasks, but effectively labeling these texts remains challenging. Traditional rule-based labeling methods fall short of capturing the nuances of diverse free-text patterns. Moreover, models using expert-annotated data are limited by data scarcity and pre-defined classes, impacting their performance, flexibility and scalability. To address these issues, our study offers three main contributions: 1) We demonstrate the potential of GPT as an adept labeler using carefully designed prompts. 2) Utilizing only the data labeled by GPT, we trained a BERT-based labeler, CheX-GPT, which operates faster and more efficiently than its GPT counterpart. 3) To benchmark labeler performance, we introduced a publicly available expert-annotated test set, MIMIC-500, comprising 500 cases from the MIMIC validation set. Our findings demonstrate that CheX-GPT not only excels in labeling accuracy over existing models, but also showcases superior efficiency, flexibility, and scalability, supported by our introduction of the MIMIC-500 dataset for robust benchmarking. Code and models are available at https://github.com/Soombit-ai/CheXGPT.

CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling

TL;DR

A BERT-based labeler is trained, CheX-GPT, which operates faster and more efficiently than its GPT counterpart, and not only excels in labeling accuracy over existing models, but also showcases superior efficiency, flexibility, and scalability.

Abstract

Free-text radiology reports present a rich data source for various medical tasks, but effectively labeling these texts remains challenging. Traditional rule-based labeling methods fall short of capturing the nuances of diverse free-text patterns. Moreover, models using expert-annotated data are limited by data scarcity and pre-defined classes, impacting their performance, flexibility and scalability. To address these issues, our study offers three main contributions: 1) We demonstrate the potential of GPT as an adept labeler using carefully designed prompts. 2) Utilizing only the data labeled by GPT, we trained a BERT-based labeler, CheX-GPT, which operates faster and more efficiently than its GPT counterpart. 3) To benchmark labeler performance, we introduced a publicly available expert-annotated test set, MIMIC-500, comprising 500 cases from the MIMIC validation set. Our findings demonstrate that CheX-GPT not only excels in labeling accuracy over existing models, but also showcases superior efficiency, flexibility, and scalability, supported by our introduction of the MIMIC-500 dataset for robust benchmarking. Code and models are available at https://github.com/Soombit-ai/CheXGPT.
Paper Structure (27 sections, 3 figures, 11 tables)

This paper contains 27 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: (a) Labeling process for the rule-based labeler. (b) Training framework for the fine-tuned labeler, using a small volume of manually annotated data. (c) Training framework of the proposed CheX-GPT, employing a large volume of pseudo-labeled data generated by LLM.
  • Figure 2: The overall framework mainly consists of the GPT labeler and CheX-GPT. Initially, a CXR report is placed at the end of the prompt for CXR labeling. GPT-4 then extracts positive findings from the report. The output from GPT-4 is categorized into 13 distinct categories by mapping, and then providing supervision for the BERT-based CheX-GPT model.
  • Figure 3: Macro averaged F1 score curve for various training data sizes (log scale). Performance tends to increase as the training data size increases. Performance on Impressions is saturated with fewer training samples than on Findings.