Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks
Lingxiao Luo, Xuanzhong Chen, Bingda Tang, Xinsheng Chen, Rong Han, Chengpeng Hu, Yujiang Li, Ting Chen
TL;DR
This work tackles the challenge of building universal foundation models for medical image analysis despite substantial spatial heterogeneity across modalities and dimensions. It introduces SPAD-Nets, a family of spatially adaptive networks, and a two-stage pre-training pipeline using a SPAD visual tokenizer (SPAD-VT) and a SPAD Vision Transformer (SPAD-ViT) trained with masked image modeling on 55 public datasets (~$9.1\text{M}$ slices). The approach achieves a unified architecture that adapts to input spacing through SPAD-conv blocks, enabling effective 2D/3D processing and strong performance on downstream classification and segmentation tasks, with notable label-efficient gains in few-shot settings. The combination of large-scale SSL data, soft-token representations, and regularized token usage yields robust token distributions and improved generalization across diverse medical imaging tasks, highlighting the practical potential for universal medical foundation models. Overall, SPAD-Nets offer a scalable path to leverage unlabeled medical data across modalities, reducing annotation burdens and enhancing cross-dataset transfer in clinical imaging analytics.
Abstract
Recent advancements in foundation models, typically trained with self-supervised learning on large-scale and diverse datasets, have shown great potential in medical image analysis. However, due to the significant spatial heterogeneity of medical imaging data, current models must tailor specific structures for different datasets, making it challenging to leverage the abundant unlabeled data. In this work, we propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure. To accomplish this, we propose spatially adaptive networks (SPAD-Nets), a family of networks that dynamically adjust the structures to adapt to the spatial properties of input images, to build such a universal foundation model. We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets. The pre-training data comprises over 9 million image slices, representing the largest, most comprehensive, and most diverse dataset to our knowledge for pre-training universal foundation models for medical image analysis. The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model. Our code is available at https://github.com/function2-llx/PUMIT.
