Table of Contents
Fetching ...

Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

Sonit Singh

TL;DR

This work investigates radiology report generation from chest X-rays using a CNN encoder and a Transformer decoder, formulated as maximizing the autoregressive likelihood $\log p(S|I) = \sum_t \log p(S_t|S_{<t};\theta)$. It compares a CNN+Transformer pipeline against a CNN+LSTM baseline on the IU-CXR dataset, showing faster training and competitive or improved natural-language metrics. The study also argues that standard NLG metrics alone are insufficient for clinical usefulness and proposes a clinical context-aware evaluation using CheXpert-derived observations plus metrics like KA, DCS, and Clinical Coherence. Overall, the results support Transformer decoders for radiology report generation and highlight the need for larger-scale data and multi-faceted evaluation to translate into clinical practice.

Abstract

Recent developments in the field of Natural Language Processing, especially language models such as the transformer have brought state-of-the-art results in language understanding and language generation. In this work, we investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics. We then applied a transformer based radiology report generation architecture, and also compare the performance of a transformer based decoder with the recurrence based decoder. Experiments were performed using the IU-CXR dataset, showing superior results to its LSTM counterpart and being significantly faster. Finally, we identify the need of evaluating radiology report generation system using both language generation metrics and classification metrics, which helps to provide robust measure of generated reports in terms of their coherence and diagnostic value.

Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

TL;DR

This work investigates radiology report generation from chest X-rays using a CNN encoder and a Transformer decoder, formulated as maximizing the autoregressive likelihood . It compares a CNN+Transformer pipeline against a CNN+LSTM baseline on the IU-CXR dataset, showing faster training and competitive or improved natural-language metrics. The study also argues that standard NLG metrics alone are insufficient for clinical usefulness and proposes a clinical context-aware evaluation using CheXpert-derived observations plus metrics like KA, DCS, and Clinical Coherence. Overall, the results support Transformer decoders for radiology report generation and highlight the need for larger-scale data and multi-faceted evaluation to translate into clinical practice.

Abstract

Recent developments in the field of Natural Language Processing, especially language models such as the transformer have brought state-of-the-art results in language understanding and language generation. In this work, we investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics. We then applied a transformer based radiology report generation architecture, and also compare the performance of a transformer based decoder with the recurrence based decoder. Experiments were performed using the IU-CXR dataset, showing superior results to its LSTM counterpart and being significantly faster. Finally, we identify the need of evaluating radiology report generation system using both language generation metrics and classification metrics, which helps to provide robust measure of generated reports in terms of their coherence and diagnostic value.
Paper Structure (20 sections, 7 equations, 6 figures, 8 tables)

This paper contains 20 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Block diagram of the (A) CNN + LSTM; and (B) CNN + Transformer model. In the proposed model (B), the CNN encoder extracts chest X-ray embeddings and the Transformer model (decoder) generates corresponding radiology report.
  • Figure 2: The encoder-decoder framework of the proposed CNN+Transformer model. The encoder is the convolutional neural network, such as the ResNet model. The decoder is the Transformer model. The Transformer decoder can have $N$ identical decoder layers. $<start>$ and $<end>$ tokens are added at the beginning and end of the radiology report text, respectively.
  • Figure 3: A selected sample case to highlight issues in evaluating radiology report with natural language generation (NLG) metrics.
  • Figure 4: Detailed architecture of clinical context-aware radiology report generation.
  • Figure 5: A walk-through example of robust evaluation for radiology report generation. CNN: Convolutional Neural Network; MLC: Multi-label Classifier; P: Precision; R: Recall: F1: F1-score.
  • ...and 1 more figures