Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis
Neel Dey, Benjamin Billot, Hallee E. Wong, Clinton J. Wang, Mengwei Ren, P. Ellen Grant, Adrian V. Dalca, Polina Golland
TL;DR
The paper addresses the generalization gap in 3D biomedical vision caused by limited public datasets. It introduces a synthetic data engine and a voxel-focused contrastive pretraining framework to produce a generalist 3D network whose features are stable across modalities and appearances, enabling downstream registration and segmentation without real-image pretraining. Key contributions include a label-ensemble data generator from ~45,000 templates, a dual-volume appearance model with Gaussian mixtures, and a multi-positive contrastive loss applied across a four-level 3D UNet with a projection head used only during pretraining. The resulting representations achieve state-of-the-art unsupervised multimodality 3D registration and provide dataset-agnostic initializations for few-shot segmentation, demonstrating strong cross-domain performance and practical impact for diverse radiology tasks.
Abstract
Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.
