Table of Contents
Fetching ...

IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchors

Zhiwei Wang, Qiang Hu, Hongkuan Shi, Li He, Man He, Wenxuan Dai, Yinjiao Tian, Xin Yang, Mei Liu, Qiang Li

TL;DR

This work tackles the cost issue of pixel-level polyp segmentation by proposing a box-supervised framework, IBoxCLA. IBox decouples location/size learning from shape via shape decoupling and confusion-region swapping, while CLA leverages an EMA teacher and latent anchors to enhance shape discrimination through contrastive learning. The method delivers competitive results against fully supervised methods and substantial gains over prior box-supervised approaches across five public polyp datasets, with additional improvements when using mix-supervision from extra box-annotated data. The approach offers a practical pathway to scalable polyp segmentation by effectively leveraging cheap box annotations and robust shape learning.

Abstract

Box-supervised polyp segmentation attracts increasing attention for its cost-effective potential. Existing solutions often rely on learning-free methods or pretrained models to laboriously generate pseudo masks, triggering Dice constraint subsequently. In this paper, we found that a model guided by the simplest box-filled masks can accurately predict polyp locations/sizes, but suffers from shape collapsing. In response, we propose two innovative learning fashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and combine them to train a robust box-supervised model IBoxCLA. The core idea behind IBoxCLA is to decouple the learning of location/size and shape, allowing for focused constraints on each of them. Specifically, IBox transforms the segmentation map into a proxy map using shape decoupling and confusion-region swapping sequentially. Within the proxy map, shapes are disentangled, while locations/sizes are encoded as box-like responses. By constraining the proxy map instead of the raw prediction, the box-filled mask can well supervise IBoxCLA without misleading its shape learning. Furthermore, CLA contributes to shape learning by generating two types of latent anchors, which are learned and updated using momentum and segmented polyps to steadily represent polyp and background features. The latent anchors facilitate IBoxCLA to capture discriminative features within and outside boxes in a contrastive manner, yielding clearer boundaries. We benchmark IBoxCLA on five public polyp datasets. The experimental results demonstrate the competitive performance of IBoxCLA compared to recent fully-supervised polyp segmentation methods, and its superiority over other box-supervised state-of-the-arts with a relative increase of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.

IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchors

TL;DR

This work tackles the cost issue of pixel-level polyp segmentation by proposing a box-supervised framework, IBoxCLA. IBox decouples location/size learning from shape via shape decoupling and confusion-region swapping, while CLA leverages an EMA teacher and latent anchors to enhance shape discrimination through contrastive learning. The method delivers competitive results against fully supervised methods and substantial gains over prior box-supervised approaches across five public polyp datasets, with additional improvements when using mix-supervision from extra box-annotated data. The approach offers a practical pathway to scalable polyp segmentation by effectively leveraging cheap box annotations and robust shape learning.

Abstract

Box-supervised polyp segmentation attracts increasing attention for its cost-effective potential. Existing solutions often rely on learning-free methods or pretrained models to laboriously generate pseudo masks, triggering Dice constraint subsequently. In this paper, we found that a model guided by the simplest box-filled masks can accurately predict polyp locations/sizes, but suffers from shape collapsing. In response, we propose two innovative learning fashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and combine them to train a robust box-supervised model IBoxCLA. The core idea behind IBoxCLA is to decouple the learning of location/size and shape, allowing for focused constraints on each of them. Specifically, IBox transforms the segmentation map into a proxy map using shape decoupling and confusion-region swapping sequentially. Within the proxy map, shapes are disentangled, while locations/sizes are encoded as box-like responses. By constraining the proxy map instead of the raw prediction, the box-filled mask can well supervise IBoxCLA without misleading its shape learning. Furthermore, CLA contributes to shape learning by generating two types of latent anchors, which are learned and updated using momentum and segmented polyps to steadily represent polyp and background features. The latent anchors facilitate IBoxCLA to capture discriminative features within and outside boxes in a contrastive manner, yielding clearer boundaries. We benchmark IBoxCLA on five public polyp datasets. The experimental results demonstrate the competitive performance of IBoxCLA compared to recent fully-supervised polyp segmentation methods, and its superiority over other box-supervised state-of-the-arts with a relative increase of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.
Paper Structure (29 sections, 7 equations, 9 figures, 5 tables)

This paper contains 29 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison between (a) a simplistic method directly minimizing Dice loss between raw segmentation maps and box-filled masks and (b) a prototype of our method, which applies Dice loss on the proxy maps of predictions, where the shapes are disentangled. Green: true positives; Yellow: false negatives; Red: false positives. As can be seen in the inference results, (a) the simplistic model is supervised by the incorrect shape information of the box-filled mask, and thus produces box-shaped over-segmentation results. In comparison, (b) the prototype of our method prevents itself from the misguiding supervision by using the shape-decoupled proxy maps, and thus can freely learn to segment boundaries.
  • Figure 2: Illustration of the training phase of IBoxCLA. The segmentation model is for pixel-level predictions, and the teacher model is its EMA copy for providing latent anchors and masks. The segmentation model is constrained in three ways: (1) Improved Box-dice (IBox) constraint, (2) Contrastive Latent-Anchor (CLA) constraint, and (3) teacher-provided mask constraint (omitted in this figure).
  • Figure 3: Details of IBox, consisting of shape decoupling and confusion-region swapping.
  • Figure 4: Details of CLA, consisting of latent anchor generation and anchor-based contrastive constraint.
  • Figure 5: Three visualization results of different box-supervised methods. Green: true positives; Yellow: false negatives; Red: false positives.
  • ...and 4 more figures