Table of Contents
Fetching ...

Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets

Hung Q. Vo, Samira Zare, Son T. Ly, Lin Wang, Chika F. Ezeana, Xiaohui Yu, Kelvin K. Wong, Stephen T. C. Wong, Hien V. Nguyen

TL;DR

The paper tackles domain generalization in mammography to promote health equity. It benchmarks invariant learning methods IRM and VREx against a rigorously optimized ERM baseline using multi-site data aggregated from the USA, Portugal, and Cyprus, with OOD testing on Egypt and Sweden. Results show that standard ERM with diverse data consistently outperforms invariant approaches, while IRM struggles with optimization and VREx offers no clear generalization gains. The authors conclude that data diversity across nations currently provides the most reliable path to robust, equitable breast cancer screening.

Abstract

Achieving health equity in Artificial Intelligence (AI) requires diagnostic models that maintain reliability across diverse populations. However, breast cancer screening systems frequently suffer from domain overfitting, degrading significantly when deployed to varying demographics. While Invariant Learning algorithms aim to mitigate this by suppressing site-specific correlations, their efficacy in medical imaging remains underexplored. This study comprehensively evaluates domain generalization techniques for mammography. We constructed a multi-source training environment aggregating datasets from the United States (CBIS-DDSM, EMBED), Portugal (INbreast, BCDR), and Cyprus (BMCD). To assess global generalizability, we evaluated performance on unseen cohorts from Egypt (CDD-CESM) and Sweden (CSAW-CC). We benchmarked Invariant Risk Minimization (IRM) and Variance Risk Extrapolation (VREx) against a rigorously optimized Empirical Risk Minimization (ERM) baseline. Contrary to expectations, standard ERM consistently outperformed specialized invariant mechanisms on out-of-domain testing. While VREx showed potential in stabilizing attention maps, invariant objectives proved unstable and prone to underfitting. We conclude that engineering equitable AI is currently best served by maximizing multi-national data diversity rather than relying on complex algorithmic invariance.

Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets

TL;DR

The paper tackles domain generalization in mammography to promote health equity. It benchmarks invariant learning methods IRM and VREx against a rigorously optimized ERM baseline using multi-site data aggregated from the USA, Portugal, and Cyprus, with OOD testing on Egypt and Sweden. Results show that standard ERM with diverse data consistently outperforms invariant approaches, while IRM struggles with optimization and VREx offers no clear generalization gains. The authors conclude that data diversity across nations currently provides the most reliable path to robust, equitable breast cancer screening.

Abstract

Achieving health equity in Artificial Intelligence (AI) requires diagnostic models that maintain reliability across diverse populations. However, breast cancer screening systems frequently suffer from domain overfitting, degrading significantly when deployed to varying demographics. While Invariant Learning algorithms aim to mitigate this by suppressing site-specific correlations, their efficacy in medical imaging remains underexplored. This study comprehensively evaluates domain generalization techniques for mammography. We constructed a multi-source training environment aggregating datasets from the United States (CBIS-DDSM, EMBED), Portugal (INbreast, BCDR), and Cyprus (BMCD). To assess global generalizability, we evaluated performance on unseen cohorts from Egypt (CDD-CESM) and Sweden (CSAW-CC). We benchmarked Invariant Risk Minimization (IRM) and Variance Risk Extrapolation (VREx) against a rigorously optimized Empirical Risk Minimization (ERM) baseline. Contrary to expectations, standard ERM consistently outperformed specialized invariant mechanisms on out-of-domain testing. While VREx showed potential in stabilizing attention maps, invariant objectives proved unstable and prone to underfitting. We conclude that engineering equitable AI is currently best served by maximizing multi-national data diversity rather than relying on complex algorithmic invariance.

Paper Structure

This paper contains 13 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Qualitative comparison of ResNet34d feature visualizations. Rows correspond to different datasets, while columns represent the training methods. Each entry displays a triplet: the original mammogram (left), the activation map for the Malignant class (center), and the activation map for the Benign class (right).
  • Figure 2: Qualitative comparison of ConvNeXt-tiny feature visualizations. Rows correspond to different datasets, while columns represent the training methods. Each entry displays a triplet: the original mammogram (left), the activation map for the Malignant class (center), and the activation map for the Benign class (right).