Table of Contents
Fetching ...

On the Automatic Generation of Medical Imaging Reports

Baoyu Jing, Pengtao Xie, Eric Xing

TL;DR

This work tackles automatic generation of medical imaging reports by proposing a multi-task framework that jointly predicts diagnostic tags and generates long, coherent paragraphs. A co-attention mechanism fuses visual image features with predicted tag embeddings to localize abnormalities and produce descriptive narration. A hierarchical LSTM decoder first generates sentence topics and then composes each sentence, enabling high-quality long-form reports. Evaluations on IU X-Ray and PEIR Gross show substantial gains over baselines, with qualitative analyses illustrating improved abnormality localization and narrative fidelity, highlighting potential to reduce clinician workload while maintaining accuracy.

Abstract

Medical imaging is widely used in clinical practice for diagnosis and treatment. Report-writing can be error-prone for unexperienced physicians, and time- consuming and tedious for experienced physicians. To address these issues, we study the automatic generation of medical imaging reports. This task presents several challenges. First, a complete report contains multiple heterogeneous forms of information, including findings and tags. Second, abnormal regions in medical images are difficult to identify. Third, the re- ports are typically long, containing multiple sentences. To cope with these challenges, we (1) build a multi-task learning framework which jointly performs the pre- diction of tags and the generation of para- graphs, (2) propose a co-attention mechanism to localize regions containing abnormalities and generate narrations for them, (3) develop a hierarchical LSTM model to generate long paragraphs. We demonstrate the effectiveness of the proposed methods on two publicly available datasets.

On the Automatic Generation of Medical Imaging Reports

TL;DR

This work tackles automatic generation of medical imaging reports by proposing a multi-task framework that jointly predicts diagnostic tags and generates long, coherent paragraphs. A co-attention mechanism fuses visual image features with predicted tag embeddings to localize abnormalities and produce descriptive narration. A hierarchical LSTM decoder first generates sentence topics and then composes each sentence, enabling high-quality long-form reports. Evaluations on IU X-Ray and PEIR Gross show substantial gains over baselines, with qualitative analyses illustrating improved abnormality localization and narrative fidelity, highlighting potential to reduce clinician workload while maintaining accuracy.

Abstract

Medical imaging is widely used in clinical practice for diagnosis and treatment. Report-writing can be error-prone for unexperienced physicians, and time- consuming and tedious for experienced physicians. To address these issues, we study the automatic generation of medical imaging reports. This task presents several challenges. First, a complete report contains multiple heterogeneous forms of information, including findings and tags. Second, abnormal regions in medical images are difficult to identify. Third, the re- ports are typically long, containing multiple sentences. To cope with these challenges, we (1) build a multi-task learning framework which jointly performs the pre- diction of tags and the generation of para- graphs, (2) propose a co-attention mechanism to localize regions containing abnormalities and generate narrations for them, (3) develop a hierarchical LSTM model to generate long paragraphs. We demonstrate the effectiveness of the proposed methods on two publicly available datasets.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An exemplar chest x-ray report. In the impression section, the radiologist provides a diagnosis. The findings section lists the radiology observations regarding each area of the body examined in the imaging study. The tags section lists the keywords which represent the critical information in the findings. These keywords are identified using the Medical Text Indexer (MTI).
  • Figure 2: Illustration of the proposed model. MLC denotes a multi-label classification network. Semantic features are the word embeddings of the predicted tags. The boldfaced tags "calcified granuloma" and "granuloma" are attended by the co-attention network.
  • Figure 3: Illustration of paragraph generated by Ours-CoAttention, Ours-no-Attention, and Soft Attention models. The underlined sentences are the descriptions of detected abnormalities. The second image is a lateral x-ray image. Top two images are positive results, the third one is a partial failure case and the bottom one is failure case. These images are from test dataset.
  • Figure 4: Visualization of co-attention for three examples. Each example is comprised of four things: (1) image and visual attentions; (2) ground truth tags and semantic attention on predicted tags; (3) generated descriptions; (4) ground truth descriptions. For the semantic attention, three tags with highest attention scores are highlighted. The underlined tags are those appearing in the ground truth.