Is Two-shot All You Need? A Label-efficient Approach for Video Segmentation in Breast Ultrasound

Jiajun Zeng; Dong Ni; Ruobing Huang

Is Two-shot All You Need? A Label-efficient Approach for Video Segmentation in Breast Ultrasound

Jiajun Zeng, Dong Ni, Ruobing Huang

TL;DR

This paper tackles the challenge of breast ultrasound video lesion segmentation under limited annotations. It introduces a two-shot VOS framework, ST-BV, built on a teacher–student self-training pipeline with quadro-inference and a source-dependent augmentation to mitigate noisy pseudo-labels, formalized by a two-shot objective $f^s = argmin_{f} \frac{1}{2N}\sum_{i=1}^N \sum_{t=1}^2 \mathcal{L}_{seg}(\hat{Y}^l, Y^l)$. To further enforce temporal coherence without extra labels, it adds explicit space-time supervision (STCS) via a contrastive spatio-temporal loss that aligns features across frames. On an in-house BUS dataset, the method achieves performance comparable to fully supervised approaches even with only 1.9% labeled frames and generalizes across backbones like XMem, highlighting practical impact for reducing annotation burden in medical video analysis.

Abstract

Breast lesion segmentation from breast ultrasound (BUS) videos could assist in early diagnosis and treatment. Existing video object segmentation (VOS) methods usually require dense annotation, which is often inaccessible for medical datasets. Furthermore, they suffer from accumulative errors and a lack of explicit space-time awareness. In this work, we propose a novel two-shot training paradigm for BUS video segmentation. It not only is able to capture free-range space-time consistency but also utilizes a source-dependent augmentation scheme. This label-efficient learning framework is validated on a challenging in-house BUS video dataset. Results showed that it gained comparable performance to the fully annotated ones given only 1.9% training labels.

Is Two-shot All You Need? A Label-efficient Approach for Video Segmentation in Breast Ultrasound

TL;DR

. To further enforce temporal coherence without extra labels, it adds explicit space-time supervision (STCS) via a contrastive spatio-temporal loss that aligns features across frames. On an in-house BUS dataset, the method achieves performance comparable to fully supervised approaches even with only 1.9% labeled frames and generalizes across backbones like XMem, highlighting practical impact for reducing annotation burden in medical video analysis.

Abstract

Paper Structure (9 sections, 8 equations, 4 figures, 2 tables)

This paper contains 9 sections, 8 equations, 4 figures, 2 tables.

Introduction
Related works
Method
Two-shot BUS VOS
Explicit space-time supervision
Experiments
Results and Discussion
Conclusion
Acknowledgments

Figures (4)

Figure 1: The difference between common and two-shot SVOS.
Figure 2: The overall architecture of ST-BV.
Figure 3: The proposed efficient two-shot training paradigm.
Figure 4: Qualitative results of different VOS methods.

Is Two-shot All You Need? A Label-efficient Approach for Video Segmentation in Breast Ultrasound

TL;DR

Abstract

Is Two-shot All You Need? A Label-efficient Approach for Video Segmentation in Breast Ultrasound

Authors

TL;DR

Abstract

Table of Contents

Figures (4)