Table of Contents
Fetching ...

MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration

Chenran Zhang, Ruiqi Wu, Tao Zhou, Yi Zhou

TL;DR

A Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast, and introduces a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective.

Abstract

Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.

MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration

TL;DR

A Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast, and introduces a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective.

Abstract

Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.
Paper Structure (25 sections, 12 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 12 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Motivations of knowledge-driven cognitive orchestration. (a) Diagnostic sensitivity varies among different diseases. (b) Representativeness of intra-class samples exhibits variation. (c) High inter-class similarity in medical images at the beginning of the pretraining.
  • Figure 2: Overview of the two-level curriculum. The pretraining data is divided into two distinct levels, label-level (left) and description-level (right). The label-level data is categorized into three stages according to the sensitivity of each modality to detect specific disease. The description-level data is clustered into the most relevant categories based on their textual descriptions. Subsequently, within each category, samples are divided into multiple stages according to the representativeness of their image features.
  • Figure 3: Visualization of image feature at different stages on ODIR200×3 dataset under the CLIP framework.
  • Figure 4: Efficiency of curriculum Learning. The proposed method has five epochs in each stage. For the baseline method, one stage represents one epoch.
  • Figure 5: Visualization of text feature at different stages on ODIR200×3 dataset under the CLIP framework.