Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training
Jinxia Yang, Bing Su, Wayne Xin Zhao, Ji-Rong Wen
TL;DR
This paper tackles the underutilization of spatio-temporal supervision signals in medical vision-language pre-training. It introduces Med-ST, which uses a Mixture of View Expert (MoVE) based multi-view image encoder for frontal and lateral chest X-ray views, coupled with modality-weighted local alignment and global-text alignment, and it adds cross-modal bidirectional cycle consistency for temporal modeling. The training objective blends $L_{global}$, $L_{local}$, and $L_{temporal}$ into a unified loss $L = L_{global} + lambda1 L_{local} + lambda2 L_{temporal}$ to learn fine-grained spatiotemporal representations. Experiments on MS-CXR-T, RSNA, and COVIDx show Med-ST achieving superior temporal classification, improved sentence similarity, and robust zero-shot performance, highlighting the practical potential for clinical decision support. Code and model are publicly released to facilitate adoption and further research.
Abstract
Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.
