Table of Contents
Fetching ...

Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

Xiuzhen Guo, Lianyuan Yu, Ji Shi, Na Lei, Hongxiao Wang

TL;DR

This work tackles semi-supervised medical image segmentation under limited labels by introducing Diff-CL, a distribution-aware framework that fuses diffusion-based distribution modeling (DS) with detail-oriented CNN segmentation (CS) via cross-pseudo supervision. It adds a 3D high-frequency Mamba module to capture global, high-frequency details efficiently and employs contrastive label propagation to transfer class-semantic information from labeled to unlabeled regions. The method defines dual losses for cross-pseudo supervision, a high-frequency attention mechanism, and a memory-bank–driven contrastive loss, integrating them into a unified semi-supervised objective with $L^{d} = L^{d}_{s} + \mu_1 L^{d}_{p}$ and $L^{c} = L^{c}_{s} + \mu_2 L^{c}_{u}$, where $L^{c}_{u} = L^{c}_{p} + \eta L_{cl}$. Empirically, Diff-CL achieves state-of-the-art performance on left atrium, BraTS brain tumor, and NIH pancreas datasets across low labeling ratios, demonstrating improved generalization and boundary fidelity thanks to the distribution perspective and synergistic model design.

Abstract

Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets. Most existing studies focus on limited samples and fail to capture the overall data distribution. We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results. On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively. However, it struggles with fine detail capture, leading to generated images with misleading details. Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details. While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise. On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL). Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks. Secondly, we design a high-frequency mamba module to capture boundary and detail information globally. Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.

Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

TL;DR

This work tackles semi-supervised medical image segmentation under limited labels by introducing Diff-CL, a distribution-aware framework that fuses diffusion-based distribution modeling (DS) with detail-oriented CNN segmentation (CS) via cross-pseudo supervision. It adds a 3D high-frequency Mamba module to capture global, high-frequency details efficiently and employs contrastive label propagation to transfer class-semantic information from labeled to unlabeled regions. The method defines dual losses for cross-pseudo supervision, a high-frequency attention mechanism, and a memory-bank–driven contrastive loss, integrating them into a unified semi-supervised objective with and , where . Empirically, Diff-CL achieves state-of-the-art performance on left atrium, BraTS brain tumor, and NIH pancreas datasets across low labeling ratios, demonstrating improved generalization and boundary fidelity thanks to the distribution perspective and synergistic model design.

Abstract

Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets. Most existing studies focus on limited samples and fail to capture the overall data distribution. We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results. On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively. However, it struggles with fine detail capture, leading to generated images with misleading details. Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details. While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise. On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL). Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks. Secondly, we design a high-frequency mamba module to capture boundary and detail information globally. Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.

Paper Structure

This paper contains 21 sections, 21 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overall overview of Diff-CL. The backbone consists of CS and DS networks with a projection head. HF-Mamba is our high-frequency mamba module. $x^l$ and $x^u$ are labeled and unlabeled data. $y^l$ is the label of labeled data. $Y^l_0 = y^l$ and $Y^u_0$ is the pseudo-label of $x^u$ from CS network. After adding t-step Gaussian noise, we get $Y^l_t$ and $Y^u_t$. $(x^l, Y^l_t)$ and $(x^u, Y^u_t)$ are concatenated as inputs of DS network. $(\tilde{f}^l, \hat{f}^u)$ are labeled and unlabeled features from DS and CS networks respectively. Inter-CL represents inter-sample contrastive learning which closes similar features and pushes dissimilar features away between $\tilde{f}^l$ and $\hat{f}^u$. There are supervised and unsupervised losses. The supervised loss includes $L^c_s$ and $L^d_s$. The unsupervised loss includes cross pseudo-supervision losses $L^c_p$ and $L^d_p$ and inter-sample contrastive learning loss $L_{cl}$.
  • Figure 2: $\tilde{f}^l$ and $\hat{f}^u$ are labeled and unlabeled features output by DS and CS networks respectively. $C$ is the class number of a dataset. $\textbf{a}$ and $\textbf{b}$ are randomly sampled feature vectors from our memory bank and $\hat{f}^u$, respectively. By calculating cosine distances between $\textbf{a}$ and $\textbf{b}$, we get unlabeled positive and negative pairs $\textbf{b}^{\gamma, P}$ and $\textbf{b}^{\gamma, N}$ of $\textbf{a}$,where $\gamma \in{(1,...,C)}$. $L_{cl}$ is the inter-sample contrastive loss.
  • Figure 3: High-frequency mamba block.
  • Figure 4: Results of qualitative comparison on LA dataset under 10$\%$ labeled data setting. GT represents the ground truth.
  • Figure 5: Results of quantitative comparison on BraTS dataset. GT represents the ground truth.
  • ...and 1 more figures