Table of Contents
Fetching ...

Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video

Hao Li, Daiwei Lu, Xing Yao, Nicholas Kavoussi, Ipek Oguz

TL;DR

This work tackles robust endoscopic image segmentation under limited annotations. It introduces Endo-SemiS, a semi-supervised framework that combines cross-supervision of two U-Nets with uncertainty-guided pseudo-labeling, joint pseudo-label supervision, and multi-level mutual learning, plus a spatiotemporal correction module to exploit video context. The method dynamically filters unreliable regions using aleatoric and epistemic uncertainty, and fuses predictions to produce reliable supervision for unlabeled frames. Experiments on kidney stone lithotripsy and colon polyp screening show Endo-SemiS achieves state-of-the-art performance, often surpassing fully supervised baselines with much less labeled data and demonstrating cross-site generalization. The work offers a practical, real-time-capable approach for robust endoscopic segmentation in resource-constrained settings.

Abstract

In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at https://github.com/MedICL-VU/Endo-SemiS

Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video

TL;DR

This work tackles robust endoscopic image segmentation under limited annotations. It introduces Endo-SemiS, a semi-supervised framework that combines cross-supervision of two U-Nets with uncertainty-guided pseudo-labeling, joint pseudo-label supervision, and multi-level mutual learning, plus a spatiotemporal correction module to exploit video context. The method dynamically filters unreliable regions using aleatoric and epistemic uncertainty, and fuses predictions to produce reliable supervision for unlabeled frames. Experiments on kidney stone lithotripsy and colon polyp screening show Endo-SemiS achieves state-of-the-art performance, often surpassing fully supervised baselines with much less labeled data and demonstrating cross-site generalization. The work offers a practical, real-time-capable approach for robust endoscopic segmentation in resource-constrained settings.

Abstract

In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at https://github.com/MedICL-VU/Endo-SemiS

Paper Structure

This paper contains 25 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Challenging ureteroscopy (a–f, left) and colonoscopy (g–h, right) images for segmentation. (a) irrigation; (b) bleeding; (c) motion blur; (d) early ablation; (e) mid ablation; (f) late ablation. The arrow indicates the target kidney stone for ablation. (g) and (h) are from the public dataset ali2023multi, which is collected from multiple imaging sites.
  • Figure 2: The proposed framework adapts the widely used cross-supervision baseline (a) with uncertainty-guided supervision to obtain reliable pseudo-labels (b–c), and further incorporates multi-level mutual learning (d) to improve cross-network consistency. Panels (b–c) (in blue) operate only on unlabeled data $x_u$, whereas (d) is applied only to labeled data $x_l$. The two networks share the same architecture but are optimized independently. $y$, $\tilde{y}$, and $\tilde{y}^{uc}$ denote the ground-truth mask, the raw pseudo-label, and the uncertainty-guided pseudo-label, respectively. $\odot$ denotes the Hadamard (element-wise) product, and $U^b$ is the binary mask from uncertainty map $U$. $x_u^s$ represents a strongly intensity-augmented version of $x_u$. We define $\tilde{y}_1^{uc}=\tilde{y}_1\odot U_1^{b}$ and $\tilde{y}_2^{uc}=\tilde{y}_2\odot U_2^{b}$, and omit them for brevity.
  • Figure 3: (a) For an unlabeled image $x_u$, uncertainty-guided pseudo-labels $\tilde{y}_1^{uc}$ and $\tilde{y}_2^{uc}$ (green boxes) are obtained by dynamically filtering the raw pseudo-labels $\tilde{y}_1$ and $\tilde{y}_2$, leading to cleaner supervision. The label $y_u$ of the unlabeled image is shown for reference only. (b) $M$ chooses the lower-uncertainty prediction at each pixel to obtain the joint pseudo-label $\tilde{y}_j^{uc}$ for more reliable supervision by correcting residual defects in $\tilde{y}_2^{uc}$ from (a). (c) Compared with the pseudo-labels at epoch $n$ in (a), the $\tilde{y}_1^{uc}$, $\tilde{y}_2^{uc}$ and $\tilde{y}_j^{uc}$ at epochs $n+1$ and $n+2$ become cleaner and more consistent with $y_u$, indicating the effectiveness of (a) and (b).
  • Figure 4: Qualitative kidney stone results (10% labeled data). Yellow circles highlight poor visibility areas. (a) fiberoptic frames, (b) digital frames, (c) fluid distortions, (d) motion blur, (e) debris during stone ablation, and (f) illumination changes.