Causal Attribution of Model Performance Gaps in Medical Imaging Under Distribution Shifts
Pedro M. Gordaliza, Nataliia Molchanova, Jaume Banus, Thomas Sanchez, Meritxell Bach Cuadra
TL;DR
The paper tackles segmentation performance drops under distribution shifts by extending causal attribution to high-dimensional outputs using Independent Causal Mechanisms and Shapley values. It models the segmentation data-generating process with a causal graph, separating acquisition and annotation shifts, and estimates each mechanism's contribution via importance sampling and density-ratio discriminators integrated with nnU-Net. Empirical results on MSSEG2016 across centers and annotators reveal context-dependent failure modes, with annotation shifts dominating in annotator-change scenarios and acquisition shifts dominating when scanners differ. The findings offer practical guidance for deploying medical image segmentation systems and prioritizing interventions such as annotation standardization or scanner harmonization.
Abstract
Deep learning models for medical image segmentation suffer significant performance drops due to distribution shifts, but the causal mechanisms behind these drops remain poorly understood. We extend causal attribution frameworks to high-dimensional segmentation tasks, quantifying how acquisition protocols and annotation variability independently contribute to performance degradation. We model the data-generating process through a causal graph and employ Shapley values to fairly attribute performance changes to individual mechanisms. Our framework addresses unique challenges in medical imaging: high-dimensional outputs, limited samples, and complex mechanism interactions. Validation on multiple sclerosis (MS) lesion segmentation across 4 centers and 7 annotators reveals context-dependent failure modes: annotation protocol shifts dominate when crossing annotators (7.4% $\pm$ 8.9% DSC attribution), while acquisition shifts dominate when crossing imaging centers (6.5% $\pm$ 9.1%). This mechanism-specific quantification enables practitioners to prioritize targeted interventions based on deployment context.
