Table of Contents
Fetching ...

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, Xiaoying Tang

TL;DR

Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings.

Abstract

Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

TL;DR

Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings.

Abstract

Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.
Paper Structure (37 sections, 16 equations, 10 figures, 9 tables)

This paper contains 37 sections, 16 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of the proposed sentence-wise prototype memory bank. The prototype embedding can group sentences sharing similar information. Each sentence representation is updated to the nearest prototype after querying.
  • Figure 2: The overall framework of the proposed PRIOR. Given a pair of medical image and report, two independent encoders first encode each modality into a common embedding space. Then, the cross-modality alignment module aligns both global and local information between the two modalities. Finally, the cross-modality conditional reconstruction module reconstructs the masked image given the report and generates the sentence prototypes given the image.
  • Figure 3: Partial fine-tuning results on 1% CheXpert. The number of blocks for fine-tuning increases from left to right. Fine-tuning with 0 block is equivalent to linear evaluation, while fine-tuning with 4 blocks is equivalent to full fine-tuning.
  • Figure 4: Representative cross-modality attention maps. (a) The related sentence is "increased bibasilar opacities are combination of increased bilateral pleural effusions and bibasilar atelectasis". (b) The related sentence is "unchanged normal size of the cardiac silhouette ".
  • Figure 5: t-SNE visualization of the high-level embeddings from the last layer of the image encoder on CheXpert 5x200.
  • ...and 5 more figures