Table of Contents
Fetching ...

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Jinxia Yang, Bing Su, Wayne Xin Zhao, Ji-Rong Wen

TL;DR

This paper tackles the underutilization of spatio-temporal supervision signals in medical vision-language pre-training. It introduces Med-ST, which uses a Mixture of View Expert (MoVE) based multi-view image encoder for frontal and lateral chest X-ray views, coupled with modality-weighted local alignment and global-text alignment, and it adds cross-modal bidirectional cycle consistency for temporal modeling. The training objective blends $L_{global}$, $L_{local}$, and $L_{temporal}$ into a unified loss $L = L_{global} + lambda1 L_{local} + lambda2 L_{temporal}$ to learn fine-grained spatiotemporal representations. Experiments on MS-CXR-T, RSNA, and COVIDx show Med-ST achieving superior temporal classification, improved sentence similarity, and robust zero-shot performance, highlighting the practical potential for clinical decision support. Code and model are publicly released to facilitate adoption and further research.

Abstract

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

TL;DR

This paper tackles the underutilization of spatio-temporal supervision signals in medical vision-language pre-training. It introduces Med-ST, which uses a Mixture of View Expert (MoVE) based multi-view image encoder for frontal and lateral chest X-ray views, coupled with modality-weighted local alignment and global-text alignment, and it adds cross-modal bidirectional cycle consistency for temporal modeling. The training objective blends , , and into a unified loss to learn fine-grained spatiotemporal representations. Experiments on MS-CXR-T, RSNA, and COVIDx show Med-ST achieving superior temporal classification, improved sentence similarity, and robust zero-shot performance, highlighting the practical potential for clinical decision support. Code and model are publicly released to facilitate adoption and further research.

Abstract

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.
Paper Structure (33 sections, 15 equations, 6 figures, 11 tables)

This paper contains 33 sections, 15 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The motivation and our framework: (a) The practice of physicians in a clinical setting. (b) We propose the Med-ST framework, which explicitly designs spatial and temporal modeling by mining spatio-temporal supervision signals from the dataset.
  • Figure 2: The framework of Med-ST. For spatial modeling, Med-ST extracts integrated multi-view visual representations, performs global alignment between paired global features, and introduces modality-weighted local alignment between paired local features. For temporal modeling, global image and text features in all pairs form two sequences respectively, and Med-ST imposes cross-modality bidirectional cycle consistency constraints between them.
  • Figure 3: The multi-view image encoder. We input both the frontal and lateral views into MoVE blocks. The right side shows the structure of MoVE.
  • Figure 4: Attention visualization with reference patch (red box) in the bounding box (black box). LA stands for local alignment. The bounding box represents the main pathological area annotated manually. It can be observed that our method achieves better correspondence with the bounding box after local alignment.
  • Figure 5: t-SNE visualizations results of image features at different time steps for 200 sequences. The left side is from MGCA, and the right side is ours.
  • ...and 1 more figures