Table of Contents
Fetching ...

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

Ayush Roy, Samin Enam, Jun Xia, Won Hwa Kim, Vishnu Suresh Lokhande

TL;DR

This work investigates medical image segmentation under data scarcity, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks, which achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets.

Abstract

Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the "Data Addition Dilemma". While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures.

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

TL;DR

This work investigates medical image segmentation under data scarcity, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks, which achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets.

Abstract

Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the "Data Addition Dilemma". While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures.

Paper Structure

This paper contains 28 sections, 22 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Strong correlation between Dice and $\mathcal{L}_{\textbf{fd}}$ (foreground-background feature discrepancy loss). Strong correlation in both NucleiSegNet decoder and CMUNet encoder layers for ultrasound and histopathology images. (b) Impact of Data Distribution Shift on Model Performance. Adding S2 (similar distribution) to S1 training improves S1 test Dice, as expected with more data. However, adding S3 (distribution shift) degrades performance, consistent with shen2024data. (c) Proposed $\mathcal{L}_{\textbf{fd}}$ applied to all U-Net layers. Encoder (green), Decoder (grey), and Bottleneck (orange) features represent mediator $Z$, optimized by $\mathcal{L}_{\textbf{fd}}$. Each layer uses $\mathcal{L}_{\textbf{fd}}$ with a unique learnable parameter $\alpha$.
  • Figure 2: (a) $\alpha$ (layer-wise weights) vs $\mathcal{L}_\textbf{fd}$ (feature discrepancy loss) for NucleiSegNet layers (TNBC) shows a similar trend across all models and datasets. (b) Right and left shifts in the test sample distribution for Dice scores and $\mathcal{L}_\textbf{fd}$ after applying $\mathcal{L}_\textbf{fd}$ (orange curve) for CMUNet (UDIAT), with a similar trend across datasets. Refined activation maps justify this improvement in Dice scores (see Figure \ref{['fig:Heatmap']}, Dec 4 and Bot) after penalizing foreground-background discrepancy with $\mathcal{L}_\textbf{fd}$. (c) Causal graph linking input X, mediator Z, label Y, and unobserved confounders U.
  • Figure 3: Data Addition Dilemma. An ablation showing the performance of various loss functions for histopathology and ultrasound datasets under the "Data Addition Dilemma" shen2024data. Data from $D_{novel}$ is added to $D_{base}$ to observe how losses handle distribution shifts. $\mathcal{L}^\textbf{exch}_\textbf{fd}$ (orange) outperforms others in mitigating distribution shift when pooling data from multiple sources. The US-TNBC dataset has fewer samples, so we only added UDIAT dataset samples until their number matched (see Table \ref{['dataset_summary']}.)
  • Figure 4: The change in Dice scores with change in $\mathcal{L}_\textbf{fd}$. The plot with axis as Dice score and $\mathcal{L}_\textbf{fd}$ for samples of TNBC naylor2018segmentation and US-TNBC for the Bottleneck (Bot) layer of NucleiSegNet lal2021nucleisegnet and CMUNet 10230609 are plotted respectively. The green arrows indicate the movement of each point after the use of $\mathcal{L}_\textbf{fd}$. The red arrow indicates the overall movement of the majority of the samples.
  • Figure 5: The steps involved in the creation of the US-TNBC dataset.
  • ...and 4 more figures

Theorems & Definitions (2)

  • proof
  • proof