Table of Contents
Fetching ...

Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Weijian Huang, Cheng Li, Hong-Yu Zhou, Hao Yang, Jiarun Liu, Yong Liang, Hairong Zheng, Shaoting Zhang, Shanshan Wang

TL;DR

MaCo is presented, a masked contrastive chest X-ray model that achieves fine-grained image understanding and zero-shot capabilities, outperforming existing methods.

Abstract

Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face crucial challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and the capability of utilizing very limited or even no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a masked contrastive chest X-ray foundation model that tackles these challenges. MaCo explores masked contrastive learning to simultaneously achieve fine-grained image understanding and zero-shot learning for a variety of medical imaging tasks. It designs a correlation weighting mechanism to adjust the correlation between masked chest X-ray image patches and their corresponding reports, thereby enhancing the model's representation learning capabilities. To evaluate the performance of MaCo, we conducted extensive experiments using 6 well-known open-source X-ray datasets. The experimental results demonstrate the superiority of MaCo over 10 state-of-the-art approaches across tasks such as classification, segmentation, detection, and phrase grounding. These findings highlight the significant potential of MaCo in advancing a wide range of medical image analysis tasks.

Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

TL;DR

MaCo is presented, a masked contrastive chest X-ray model that achieves fine-grained image understanding and zero-shot capabilities, outperforming existing methods.

Abstract

Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face crucial challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and the capability of utilizing very limited or even no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a masked contrastive chest X-ray foundation model that tackles these challenges. MaCo explores masked contrastive learning to simultaneously achieve fine-grained image understanding and zero-shot learning for a variety of medical imaging tasks. It designs a correlation weighting mechanism to adjust the correlation between masked chest X-ray image patches and their corresponding reports, thereby enhancing the model's representation learning capabilities. To evaluate the performance of MaCo, we conducted extensive experiments using 6 well-known open-source X-ray datasets. The experimental results demonstrate the superiority of MaCo over 10 state-of-the-art approaches across tasks such as classification, segmentation, detection, and phrase grounding. These findings highlight the significant potential of MaCo in advancing a wide range of medical image analysis tasks.
Paper Structure (27 sections, 6 equations, 4 figures, 8 tables)

This paper contains 27 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The proposed MaCo framework. (a) An illustration of the masked contrastive learning strategy employed in MaCo, which leverages the advantages of both contrastive learning and pretext tasks. LR denotes the low-resolution image obtained after downsampling, while HR refers to the original high-resolution image. (b) The proposed correlation weighting mechanism, (i) shows the basic structure of MaCo, where image and text representations are compared using a contrastive loss, (ii) presents the procedure to generate the importance score, and (iii) plots the method to build correlations.
  • Figure 2: Qualitative phrase-grounding results when provided with description phrases. We visualize the association of vision and language on the MS-CXR dataset. The description phrases are marked in white font in the image column. The gold standard annotations outlined by clinical experts are represented with dashed boxes.
  • Figure 3: Visualization of the weights of the proposed correlation weighting mechanism. The number under the picture indicates the training epoch. After training, the weights are larger in the central regions with a higher incidence of disease and smaller in the background regions around the edges.
  • Figure 4: Disease-level zero-shot classification performance of different methods on the NIH Chest X-ray dataset