Table of Contents
Fetching ...

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

Weiwei Cao, Jianpeng Zhang, Yingda Xia, Tony C. W. Mok, Zi Li, Xianghua Ye, Le Lu, Jian Zheng, Yuxing Tang, Ling Zhang

TL;DR

This work tackles the lack of large annotated chest CT datasets by bootstrapping 3D CT understanding through language supervision and cross-modal distillation from a 2D chest X-ray expert model. It introduces a language-guided retrieval mechanism to pair CT images with semantically similar X-ray references, enabling cross-modal knowledge transfer even without paired CT-XR data. A robust RoCo objective ($L_{RoCo}$) plus dual distillation of pairwise and semantic relations ($h^{CT}, h^{*}, p^{CT}, p^{*}$) aligns CT images and radiology reports, while entity-focused masking (EFM) emphasizes critical clinical terms. On ChestCT-16K and ChestCT-EXT, BIUD demonstrates strong zero-shot, report generation, and fine-tuning performance, approaching radiologist capabilities in some tasks and highlighting the potential for annotation-free, language-guided medical image understanding in 3D imaging.

Abstract

Radiologists highly desire fully automated versatile AI for medical imaging interpretation. However, the lack of extensively annotated large-scale multi-disease datasets has hindered the achievement of this goal. In this paper, we explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. In light of the limited availability of image-report pairs, we bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. Specifically, we propose a language-guided retrieval method to match each 3D CT image with its semantically closest 2D X-ray image, and perform pair-wise and semantic relation knowledge distillation. Subsequently, we use contrastive learning to align images and reports within the same patient while distinguishing them from the other patients. However, the challenge arises when patients have similar semantic diagnoses, such as healthy patients, potentially confusing if treated as negatives. We introduce a robust contrastive learning that identifies and corrects these false negatives. We train our model with over 12,000 pairs of chest CT images and radiology reports. Extensive experiments across multiple scenarios, including zero-shot learning, report generation, and fine-tuning processes, demonstrate the model's feasibility in interpreting chest CT images.

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

TL;DR

This work tackles the lack of large annotated chest CT datasets by bootstrapping 3D CT understanding through language supervision and cross-modal distillation from a 2D chest X-ray expert model. It introduces a language-guided retrieval mechanism to pair CT images with semantically similar X-ray references, enabling cross-modal knowledge transfer even without paired CT-XR data. A robust RoCo objective () plus dual distillation of pairwise and semantic relations () aligns CT images and radiology reports, while entity-focused masking (EFM) emphasizes critical clinical terms. On ChestCT-16K and ChestCT-EXT, BIUD demonstrates strong zero-shot, report generation, and fine-tuning performance, approaching radiologist capabilities in some tasks and highlighting the potential for annotation-free, language-guided medical image understanding in 3D imaging.

Abstract

Radiologists highly desire fully automated versatile AI for medical imaging interpretation. However, the lack of extensively annotated large-scale multi-disease datasets has hindered the achievement of this goal. In this paper, we explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. In light of the limited availability of image-report pairs, we bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. Specifically, we propose a language-guided retrieval method to match each 3D CT image with its semantically closest 2D X-ray image, and perform pair-wise and semantic relation knowledge distillation. Subsequently, we use contrastive learning to align images and reports within the same patient while distinguishing them from the other patients. However, the challenge arises when patients have similar semantic diagnoses, such as healthy patients, potentially confusing if treated as negatives. We introduce a robust contrastive learning that identifies and corrects these false negatives. We train our model with over 12,000 pairs of chest CT images and radiology reports. Extensive experiments across multiple scenarios, including zero-shot learning, report generation, and fine-tuning processes, demonstrate the model's feasibility in interpreting chest CT images.
Paper Structure (23 sections, 2 equations, 8 figures, 7 tables)

This paper contains 23 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Models tailored for specific diseases demand doctors to annotate each image. Creating a multi-disease model involves more time and effort for comprehensive data annotation. In contrast, our model learns to diagnose various diseases from both images and reports, eliminating the need for additional annotations.
  • Figure 2: Illustration of distilling knowledge from the X-ray expert model to the CT image encoder.
  • Figure 3: Framework of CT image-report alignment.
  • Figure 4: The left illustrates the positive pairs, negative pairs, and false negative pairs in robust contrastive learning. The right shows the images and reports of the second, sixth, and ninth samples. These reports all indicate that the patient's lungs are healthy and without abnormalities.
  • Figure 5: Violin plots illustrating the distribution of zero-shot classification probabilities obtained by four models, CLIP, BLIP, DCL, and our proposed BIUD, across six tasks.
  • ...and 3 more figures