Table of Contents
Fetching ...

Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning

Yupei Zhang, Li Pan, Qiushi Yang, Tan Li, Zhen Chen

TL;DR

The paper addresses the limited transfer of knowledge from unlabeled multi-modal medical data to downstream tasks by identifying distribution and modality heterogeneity as key barriers. It introduces UMD, a two-stage framework featuring MR-Pretrain, which combines data-level and feature-level reconstruction (with losses $L_{\rm MIM}$, $L_{\rm MLM}$, $L_{\rm FeaMIM}$, $L_{\rm FeaMLM}$, and $L_{\rm ITM}$, balanced by $\alpha$), and heterogeneity-combat downstream tuning that includes TD-Calib and GM-Coord to align with downstream distributions and coordinate multi-modal optimization. Extensive experiments on five public medical datasets for VQA, image-text retrieval, and image-text classification demonstrate that UMD outperforms state-of-the-art approaches, with ablations confirming the complementary value of MR-Pretrain, TD-Calib, and GM-Coord. The work provides a principled path to leverage unlabeled medical data for robust, multi-modal diagnostic performance, enabling practical improvements in clinical decision support. The framework’s emphasis on high-level semantic feature learning and dynamic modality balancing offers a significant step toward scalable, distribution-aware medical AI.

Abstract

Medical multi-modal pre-training has revealed promise in computer-aided diagnosis by leveraging large-scale unlabeled datasets. However, existing methods based on masked autoencoders mainly rely on data-level reconstruction tasks, but lack high-level semantic information. Furthermore, two significant heterogeneity challenges hinder the transfer of pre-trained knowledge to downstream tasks, \textit{i.e.}, the distribution heterogeneity between pre-training data and downstream data, and the modality heterogeneity within downstream data. To address these challenges, we propose a Unified Medical Multi-modal Diagnostic (UMD) framework with tailored pre-training and downstream tuning strategies. Specifically, to enhance the representation abilities of vision and language encoders, we propose the Multi-level Reconstruction Pre-training (MR-Pretrain) strategy, including a feature-level and data-level reconstruction, which guides models to capture the semantic information from masked inputs of different modalities. Moreover, to tackle two kinds of heterogeneities during the downstream tuning, we present the heterogeneity-combat downstream tuning strategy, which consists of a Task-oriented Distribution Calibration (TD-Calib) and a Gradient-guided Modality Coordination (GM-Coord). In particular, TD-Calib fine-tunes the pre-trained model regarding the distribution of downstream datasets, and GM-Coord adjusts the gradient weights according to the dynamic optimization status of different modalities. Extensive experiments on five public medical datasets demonstrate the effectiveness of our UMD framework, which remarkably outperforms existing approaches on three kinds of downstream tasks.

Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning

TL;DR

The paper addresses the limited transfer of knowledge from unlabeled multi-modal medical data to downstream tasks by identifying distribution and modality heterogeneity as key barriers. It introduces UMD, a two-stage framework featuring MR-Pretrain, which combines data-level and feature-level reconstruction (with losses , , , , and , balanced by ), and heterogeneity-combat downstream tuning that includes TD-Calib and GM-Coord to align with downstream distributions and coordinate multi-modal optimization. Extensive experiments on five public medical datasets for VQA, image-text retrieval, and image-text classification demonstrate that UMD outperforms state-of-the-art approaches, with ablations confirming the complementary value of MR-Pretrain, TD-Calib, and GM-Coord. The work provides a principled path to leverage unlabeled medical data for robust, multi-modal diagnostic performance, enabling practical improvements in clinical decision support. The framework’s emphasis on high-level semantic feature learning and dynamic modality balancing offers a significant step toward scalable, distribution-aware medical AI.

Abstract

Medical multi-modal pre-training has revealed promise in computer-aided diagnosis by leveraging large-scale unlabeled datasets. However, existing methods based on masked autoencoders mainly rely on data-level reconstruction tasks, but lack high-level semantic information. Furthermore, two significant heterogeneity challenges hinder the transfer of pre-trained knowledge to downstream tasks, \textit{i.e.}, the distribution heterogeneity between pre-training data and downstream data, and the modality heterogeneity within downstream data. To address these challenges, we propose a Unified Medical Multi-modal Diagnostic (UMD) framework with tailored pre-training and downstream tuning strategies. Specifically, to enhance the representation abilities of vision and language encoders, we propose the Multi-level Reconstruction Pre-training (MR-Pretrain) strategy, including a feature-level and data-level reconstruction, which guides models to capture the semantic information from masked inputs of different modalities. Moreover, to tackle two kinds of heterogeneities during the downstream tuning, we present the heterogeneity-combat downstream tuning strategy, which consists of a Task-oriented Distribution Calibration (TD-Calib) and a Gradient-guided Modality Coordination (GM-Coord). In particular, TD-Calib fine-tunes the pre-trained model regarding the distribution of downstream datasets, and GM-Coord adjusts the gradient weights according to the dynamic optimization status of different modalities. Extensive experiments on five public medical datasets demonstrate the effectiveness of our UMD framework, which remarkably outperforms existing approaches on three kinds of downstream tasks.
Paper Structure (31 sections, 13 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 13 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The comparison of pre-training strategies. Different from the existing methods (a) that aim for data-level reconstruction, we design a novel multi-level reconstruction pre-training (b) that enhances the encoder to learn transferable semantic features by incorporating data-level and feature-level reconstruction.
  • Figure 2: The comparison of fine-tuning strategies. Different from existing methods (a) and (b) that directly fine-tune the entire network or the final linear layer, we design a novel heterogeneity-combat downstream tuning (c) that promotes the encoder to learn semantic features of downstream data with reconstruction and boosts various downstream tasks.
  • Figure 3: Our MR-Pretrain exploits generalizable features from a large-scale unlabeled pre-training dataset in a dual-stream workflow. Besides the data-level reconstruction, we perform the feature-level reconstruction pretext task of features to encourage transferable representation learning.
  • Figure 4: Our heterogeneity-combat tuning facilitates medical diagnosis on downstream datasets. (a) The TD-Calib firstly calibrates the student multi-modal encoder to bridge the distribution gap, and then (b) the GM-Coord performs supervised fine-tuning to balance the modality optimization. For ease of understanding, we elaborate on the case of $\rho^T > 1$, where the gradient of the language modality should be modulated, as shown in (b).
  • Figure 5: Ablation study on the hyper-parameter $\alpha$ in MR-Pretrain. Our UMD framework achieves the best performance when $\alpha$ is set as $0.5$.
  • ...and 3 more figures