Table of Contents
Fetching ...

Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks

Lingxiao Luo, Xuanzhong Chen, Bingda Tang, Xinsheng Chen, Rong Han, Chengpeng Hu, Yujiang Li, Ting Chen

TL;DR

This work tackles the challenge of building universal foundation models for medical image analysis despite substantial spatial heterogeneity across modalities and dimensions. It introduces SPAD-Nets, a family of spatially adaptive networks, and a two-stage pre-training pipeline using a SPAD visual tokenizer (SPAD-VT) and a SPAD Vision Transformer (SPAD-ViT) trained with masked image modeling on 55 public datasets (~$9.1\text{M}$ slices). The approach achieves a unified architecture that adapts to input spacing through SPAD-conv blocks, enabling effective 2D/3D processing and strong performance on downstream classification and segmentation tasks, with notable label-efficient gains in few-shot settings. The combination of large-scale SSL data, soft-token representations, and regularized token usage yields robust token distributions and improved generalization across diverse medical imaging tasks, highlighting the practical potential for universal medical foundation models. Overall, SPAD-Nets offer a scalable path to leverage unlabeled medical data across modalities, reducing annotation burdens and enhancing cross-dataset transfer in clinical imaging analytics.

Abstract

Recent advancements in foundation models, typically trained with self-supervised learning on large-scale and diverse datasets, have shown great potential in medical image analysis. However, due to the significant spatial heterogeneity of medical imaging data, current models must tailor specific structures for different datasets, making it challenging to leverage the abundant unlabeled data. In this work, we propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure. To accomplish this, we propose spatially adaptive networks (SPAD-Nets), a family of networks that dynamically adjust the structures to adapt to the spatial properties of input images, to build such a universal foundation model. We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets. The pre-training data comprises over 9 million image slices, representing the largest, most comprehensive, and most diverse dataset to our knowledge for pre-training universal foundation models for medical image analysis. The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model. Our code is available at https://github.com/function2-llx/PUMIT.

Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks

TL;DR

This work tackles the challenge of building universal foundation models for medical image analysis despite substantial spatial heterogeneity across modalities and dimensions. It introduces SPAD-Nets, a family of spatially adaptive networks, and a two-stage pre-training pipeline using a SPAD visual tokenizer (SPAD-VT) and a SPAD Vision Transformer (SPAD-ViT) trained with masked image modeling on 55 public datasets (~ slices). The approach achieves a unified architecture that adapts to input spacing through SPAD-conv blocks, enabling effective 2D/3D processing and strong performance on downstream classification and segmentation tasks, with notable label-efficient gains in few-shot settings. The combination of large-scale SSL data, soft-token representations, and regularized token usage yields robust token distributions and improved generalization across diverse medical imaging tasks, highlighting the practical potential for universal medical foundation models. Overall, SPAD-Nets offer a scalable path to leverage unlabeled medical data across modalities, reducing annotation burdens and enhancing cross-dataset transfer in clinical imaging analytics.

Abstract

Recent advancements in foundation models, typically trained with self-supervised learning on large-scale and diverse datasets, have shown great potential in medical image analysis. However, due to the significant spatial heterogeneity of medical imaging data, current models must tailor specific structures for different datasets, making it challenging to leverage the abundant unlabeled data. In this work, we propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure. To accomplish this, we propose spatially adaptive networks (SPAD-Nets), a family of networks that dynamically adjust the structures to adapt to the spatial properties of input images, to build such a universal foundation model. We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets. The pre-training data comprises over 9 million image slices, representing the largest, most comprehensive, and most diverse dataset to our knowledge for pre-training universal foundation models for medical image analysis. The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model. Our code is available at https://github.com/function2-llx/PUMIT.
Paper Structure (46 sections, 1 theorem, 17 equations, 6 figures, 6 tables)

This paper contains 46 sections, 1 theorem, 17 equations, 6 figures, 6 tables.

Key Result

Theorem 1

If both $\omega_x$ and $\omega_z$ are non-zero algebraic numbers, and $\frac{\omega_x}{\omega_z}$ is an irrational number, then for any $(t_x^{(1)}, t_z^{(1)}), (t_x^{(2)}, t_z^{(2)}) \in \mathbb{Z}^2$ where $(t_x^{(1)}, t_z^{(1)}) \neq (t_x^{(2)}, t_z^{(2)})$, it holds that $\mathcal{R}_{x, z}(t_x^

Figures (6)

  • Figure 1: The first two stages of U-Net encoders designed by nnU-Net for Task 1 (left) and Task 5 (right) of MSD challenge MSD, and the version of our proposed SPAD-Nets (middle). Convolution parameters ($k$ for kernel size, $s$ for stride) along the depth dimension are indicated. The structures designed by nnU-Net for datasets with different spacing have incompatible parts (marked as red). The SPAD-Nets are able to handle images from both tasks by adapting structures to input spatial properties.
  • Figure 2: Illustration of our pre-training framework for both SPAD-VT and SPAD-ViT. The models built with our proposed SPAD-Nets can process a wide range of images using a unified model structure. Note that SPAD-VT is trained first (optimizing $L_{\mathrm{VT}}$), and is fixed during the training of SPAD-ViT (optimizing $L_{\mathrm{MIM}}$).
  • Figure 3: Reconstruction results of SPAD-VT on unseen images. The images are taken from different body parts with diverse imaging modalities and spatial properties. For each pair of images, the original image is on the left and the reconstructed image is on the right. The first row: (left) chest X-ray image from CheXpert chexpert; (middle) fundus photograph from GAMMA; (right) breast ultrasound image from BUSI BUSI. The second row: (left): lung CT image from LIDC-IDRI LIDC-IDRI; (middle): abdomen CT image from FLARE 2022; (right): prostate MRI image from PROSTATE-MRI prostate-mri. More examples are available in the appendix.
  • Figure 4: Token utilization histogram. Every eight tokens with consecutive indexes are merged into one rectangle for better visualization within the limited page width.
  • Figure 5: Minimum intervals between rotation angles for different $i$. $i$ is the index of dimensions $d$ of query or key vector $\bm{a}$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof