Table of Contents
Fetching ...

Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

Sebo Diaz, Polina Golland, Elfar Adalsteinsson, Neel Dey

Abstract

We present DropGen, a simple and theoretically-grounded approach for domain generalization in 3D biomedical image segmentation. Modern segmentation models degrade sharply under shifts in modality, disease severity, clinical sites, and other factors, creating brittle models that limit reliable deployment. Existing domain generalization methods rely on extreme augmentations, mixing domain statistics, or architectural redesigns, yet incur significant implementation overhead and yield inconsistent performance across biomedical settings. DropGen instead proposes a principled learning strategy with minimal overhead that leverages both source-domain image intensities and domain-stable foundation model representations to train robust segmentation models. As a result, DropGen achieves strong gains in both fully supervised and few-shot segmentation across a broad range of shifts in biomedical studies. Unlike prior approaches, DropGen is architecture- and loss-agnostic, compatible with standard augmentation pipelines, computationally lightweight, and tackles arbitrary anatomical regions. Our implementation is freely available at https://github.com/sebodiaz/DropGen.

Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

Abstract

We present DropGen, a simple and theoretically-grounded approach for domain generalization in 3D biomedical image segmentation. Modern segmentation models degrade sharply under shifts in modality, disease severity, clinical sites, and other factors, creating brittle models that limit reliable deployment. Existing domain generalization methods rely on extreme augmentations, mixing domain statistics, or architectural redesigns, yet incur significant implementation overhead and yield inconsistent performance across biomedical settings. DropGen instead proposes a principled learning strategy with minimal overhead that leverages both source-domain image intensities and domain-stable foundation model representations to train robust segmentation models. As a result, DropGen achieves strong gains in both fully supervised and few-shot segmentation across a broad range of shifts in biomedical studies. Unlike prior approaches, DropGen is architecture- and loss-agnostic, compatible with standard augmentation pipelines, computationally lightweight, and tackles arbitrary anatomical regions. Our implementation is freely available at https://github.com/sebodiaz/DropGen.

Paper Structure

This paper contains 20 sections, 2 theorems, 7 equations, 9 figures, 13 tables.

Key Result

proposition 1

Given Assumptions ass:stable--ass:informative, let $h_{\theta}$ be a model whose first layer computes: $a^{(1)} = \sigma( W_{u} \star X_{u} + W_{s} \star X_{s} )$ where $W_u, W_s$ denote the first-layer kernel slices corresponding to the unstable and stable input channels, respectively, $\star$ is c

Figures (9)

  • Figure 1: Training on Stable Representations. When trained on in-domain CT (A, top) and tested on out-of-domain MRI (A, bottom), standard ERM models produce representations that are unstable under domain shifts (B). Although performant on unseen in-domain data (E, top), this instability leads to degraded performance on new out-of-distribution images (E, bottom). DropGen instead jointly trains on both in-domain image intensities (A, top) and stable, domain-invariant representations extracted by foundation models (C, top) and regularizes this combination. Grounded and motivated by our theoretical analyses, this framework enables training robust segmentors that generalize new domains automatically without any adaptation (F, bottom). Representations were visualized by mapping 3 arbitrary channels to RGB.
  • Figure 2: Method overview.Left: Given a standard PyTorch training loop, the green lines are the only additions required, demonstrating DropGen's simplicity. Right: The probabilistic graphical model we use for domain generalization. Label $Y$ generates both stable $X_s$ and unstable $X_u$ variables and the environment $E$ influences only $X_u$.
  • Figure 3: Qualitative all-data segmentation results. The rows corresponding to different datasets and domain shift types. Column 1 visualizes a representative in-domain training sample, and columns 3--7 illustrate qualitative results on out-of-domain test set examples. The red arrows highlight regions with incorrect predictions.
  • Figure 4: Ablating Feature Combination Regularization. Top: We train models with only the image ("Image Only"), only the representations ("Reps. Only"), and both concatenated with and without feature combination regularization via dropout (without: $\text{DO} = 0\%$; with: $\text{DO}=25\%$, $50\%$, $75\%$). Regularizing the combination consistently provides increased validation performance over training on the images or the representations alone and over simply combining them without regularization. Bottom: We perform a one-channel removal analysis to measure the sensitivity in validation ($\mathrm{\Delta Dice}$) w.r.t. the full input (image and representations). Without dropout, the model struggles to balance the two sources of information and relies heavily on the image intensity input and less so on the stable inputs, as indicated by the drop in performance. This is alleviated by increasing the dropout probability.
  • Figure 5: Comparing cross-modality representations extracted by foundation models.Rows 1--4: Given a subject from BraTS, we visualize its representations across MRI sequence shifts produced by nnInteractive isensee2025nninteractive (rows 1, 2) and anatomix dey2024learning (rows 3, 4), finding that anatomix representations are more stable and suitable for domain generalization. Rows 5--8: We now produce a similar visualization for two unpaired subjects with cross-modality domain shift from the AMOS dataset and find the same stability trend.
  • ...and 4 more figures

Theorems & Definitions (9)

  • remark 1: Shortcut Learning
  • remark 2
  • proposition 1: Stationarity forces use of stable inputs
  • proof
  • proposition 2: Stable-only performance ceiling
  • proof
  • remark 3
  • proof
  • proof