Table of Contents
Fetching ...

Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis

Neel Dey, Benjamin Billot, Hallee E. Wong, Clinton J. Wang, Mengwei Ren, P. Ellen Grant, Adrian V. Dalca, Polina Golland

TL;DR

The paper addresses the generalization gap in 3D biomedical vision caused by limited public datasets. It introduces a synthetic data engine and a voxel-focused contrastive pretraining framework to produce a generalist 3D network whose features are stable across modalities and appearances, enabling downstream registration and segmentation without real-image pretraining. Key contributions include a label-ensemble data generator from ~45,000 templates, a dual-volume appearance model with Gaussian mixtures, and a multi-positive contrastive loss applied across a four-level 3D UNet with a projection head used only during pretraining. The resulting representations achieve state-of-the-art unsupervised multimodality 3D registration and provide dataset-agnostic initializations for few-shot segmentation, demonstrating strong cross-domain performance and practical impact for diverse radiology tasks.

Abstract

Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.

Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis

TL;DR

The paper addresses the generalization gap in 3D biomedical vision caused by limited public datasets. It introduces a synthetic data engine and a voxel-focused contrastive pretraining framework to produce a generalist 3D network whose features are stable across modalities and appearances, enabling downstream registration and segmentation without real-image pretraining. Key contributions include a label-ensemble data generator from ~45,000 templates, a dual-volume appearance model with Gaussian mixtures, and a multi-positive contrastive loss applied across a four-level 3D UNet with a projection head used only during pretraining. The resulting representations achieve state-of-the-art unsupervised multimodality 3D registration and provide dataset-agnostic initializations for few-shot segmentation, demonstrating strong cross-domain performance and practical impact for diverse radiology tasks.

Abstract

Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.

Paper Structure

This paper contains 32 sections, 1 equation, 12 figures, 12 tables, 1 algorithm.

Figures (12)

  • Figure 1: Representations produced by our framework, trained only on synthetic data, are approximately stable across imaging modalities, field-of-views, and poses on real unseen volumes from various datasets. For each anatomical region (rows), we show two example volumes with substantial variation (col. 1) and six arbitrarily selected network output channels (cols. 2--7) that illustrate this stability. These features and network weights can be used for several voxel-level tasks.
  • Figure 2: Data engine.A. We randomly sample binary labels as templates from a large database of anatomical segmentations to create 3D label ensembles of randomly deformed templates. B. Given a synthesized label ensemble and an appearance model, we synthesize two volumes to pretrain a network with a dense multi-view contrastive objective. C. Example synthetic training volumes produced by our data engine. These samples are not intended to be necessarily realistic, but rather to serve as diverse and useful training data for learning general tasks in arbitrary radiological domains.
  • Figure 3: Representation learning. Given a 3D label map and two corresponding synthetic volumes sampled from our data engine, we process them using a single UNet with shared weights. The UNet is pretrained contrastively at each decoder layer: for a randomly sampled anchor, features sampled from the same label in both volumes serve as positives, while features from other labels act as negatives.
  • Figure 4: Multi-modality 3D registration. Using our representations with the ConvexAdam registration solver ("Ours") yields accurate alignment of challenging pairs of intra-subject abdominal MRI-CT (top) and inter-subject cardiac MRI-CT (bottom) with large deformations across modalities.
  • Figure 5: Multi-modality 3D registration results.(a) Dice boxplots for each method for L2RAb (left group) and MM-WHS (right group), with corresponding medians reported on top of each box and the mean percentages of voxels with folds produced by each method reported at the bottom; (b) Using our features leads to consistent registration improvements at the subject-level.
  • ...and 7 more figures