Table of Contents
Fetching ...

EI: Early Intervention for Multimodal Imaging based Disease Recognition

Qijie Wei, Hailan Lin, Xirong Li

Abstract

Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing "fusion after unimodal image embedding" paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality's embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.

EI: Early Intervention for Multimodal Imaging based Disease Recognition

Abstract

Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing "fusion after unimodal image embedding" paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality's embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.
Paper Structure (19 sections, 8 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Multimodal medical images and their patch-level similarity maps w.r.t. the [CLS] token. VFM: DINOv2. As [CLS] is used for classification, such maps reflect patch-wise contributions to the final prediction. Per target modality (say CFP), the inclusion of the [INT] token from its reference modality (say OCT) leads to more lesion-focused maps. Best viewed in color.
  • Figure 2: Proposed Mixture of Low-varied-Ranks Adaptation (MoR) method for parameter-efficient VFM adaptation. Compared to LoRA lora and LoRAMoE lora-moe, MoR has two novel designs: 1) multiple LoRAs with distinct ranks instead of a fixed-value rank, and 2) a relaxted router with a bypass to adaptively accept or reject the adaptation per instance.