Table of Contents
Fetching ...

Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training

Ameera Bawazir, Kebin Wu, Wenbin Li

TL;DR

Uni-Mlip is introduced, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training and significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

Abstract

Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbf{Uni-Mlip}, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training

TL;DR

Uni-Mlip is introduced, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training and significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

Abstract

Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbf{Uni-Mlip}, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

Paper Structure

This paper contains 15 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of our proposed model architecture Uni-Mlip. The colors indicate modality-specific components such that orange is for image, blue is for text, and green is for multimodal based components.
  • Figure 2: Evaluation Results on Image-to-Text (left) and Text-to-Image (right) retrieval tasks on ROCO dataset.