Table of Contents
Fetching ...

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu, Dong Wei, Qiong Peng, Yawen Huang, Xian Wu, Yefeng Zheng, Liansheng Wang

TL;DR

This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting and a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities.

Abstract

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

TL;DR

This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting and a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities.

Abstract

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.
Paper Structure (9 sections, 6 equations, 3 figures, 5 tables)

This paper contains 9 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An example CT image and corresponding report demonstrating highly structured text descriptions and visual correspondences.
  • Figure 2: Overview of our proposed framework.
  • Figure 3: Ground truth and example reports generated by PromptMRG jin2024promptmrg and our method for a case in CT-RATE hamamci2024foundation. Colored texts and color- and order-matching boxes in CT slices indicate clinic-relevant findings in the ground truth. Our method not only produces sentences with fine details but also identifies more observations than PromptMRG. CT slices are shown with varying windows to highlight lesions.