Table of Contents
Fetching ...

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Xiaoxuan He, Yifan Yang, Xinyang Jiang, Xufang Luo, Haoji Hu, Siyun Zhao, Dongsheng Li, Yuqing Yang, Lili Qiu

TL;DR

UniMedI introduces a language-guided unified Vision-Language Pre-training framework to fuse 2D X-ray and 3D CT medical images into a common semantic space driven by radiology reports. It achieves this with an attentive slice-selection mechanism that creates pseudo-pairs between 2D and 3D data and a VL contrastive objective, augmented by self-distillation to enhance cross-dimension interactions. The approach is evaluated on ten datasets spanning classification, segmentation, and retrieval tasks, showing consistent gains over specialized baselines and demonstrating data-efficient, universal visual representations. By leveraging diagnostic reports as a cross-modality bridge, UniMedI advances practical VLP for diverse medical imaging applications with potential for broader clinical impact.

Abstract

Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different various number of dimensions (e.g., 3D images like Computed Tomography). To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, we effectively uncover visual modality information, identifying the affected areas in 2D X-rays and slices containing lesion in sophisticated 3D CT scans, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across 10 different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

TL;DR

UniMedI introduces a language-guided unified Vision-Language Pre-training framework to fuse 2D X-ray and 3D CT medical images into a common semantic space driven by radiology reports. It achieves this with an attentive slice-selection mechanism that creates pseudo-pairs between 2D and 3D data and a VL contrastive objective, augmented by self-distillation to enhance cross-dimension interactions. The approach is evaluated on ten datasets spanning classification, segmentation, and retrieval tasks, showing consistent gains over specialized baselines and demonstrating data-efficient, universal visual representations. By leveraging diagnostic reports as a cross-modality bridge, UniMedI advances practical VLP for diverse medical imaging applications with potential for broader clinical impact.

Abstract

Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different various number of dimensions (e.g., 3D images like Computed Tomography). To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, we effectively uncover visual modality information, identifying the affected areas in 2D X-rays and slices containing lesion in sophisticated 3D CT scans, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across 10 different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.
Paper Structure (36 sections, 2 equations, 7 figures, 9 tables)

This paper contains 36 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Observations that motivate the language-guided strategy for integrating 2D and 3D medical images in VLP.
  • Figure 2: t-SNE visualizations of image representations by models trained with different VLP methods (2D: X-rays, 3D: CT). Both modalities are on the same disease, i.e., pneumonia, and the differences among sub-figures are highlighted with circles. (a) Two models for different image modalities are trained individually in separate VLP process. (b) One models for different image modalities are trained in one VLP processes, but without designs in UniMedI. (c) UniMedI, introducing pseudo-paired '2D' and 3D images in one unified framework.
  • Figure 3: Illustration for the proposed UniMedI framework. The overall pipeline is shown in the left part, and key designs are displayed in the right part.
  • Figure 4: Illustration for the attentive slice selection strategy that creates pseudo-pairs for 2D and 3D medical images.
  • Figure 5: Visualization of masking and slices selection result under the guidance of language.
  • ...and 2 more figures