Table of Contents
Fetching ...

Boosting Unsupervised Segmentation Learning

Alp Eren Sari, Francesco Locatello, Paolo Favaro

TL;DR

This work tackles the limited resolution of unsupervised segmentation masks produced by state-of-the-art methods that rely on downsampled features. It introduces two practical techniques: guided filtering using the luminance channel as guidance to refine segmentation masks with negligible compute overhead, and a multi-scale consistency criterion implemented via a teacher-student framework with a cropping-based equivariance loss $L_{eq}$ and a stop-gradient mechanism to prevent mask collapse. The methods deliver SotA results on unsupervised saliency benchmarks DUT-OMRON, DUTS-TE, and ECSSD, and yield improvements in CorLoc for unsupervised single-object detection on VOC and COCO20K, while remaining backbones-agnostic. The approach is modular and easy to apply across diverse unsupervised segmentation methods, with code to be released and extensive ablations demonstrating guided filtering as a key driver of gains.

Abstract

We present two practical improvement techniques for unsupervised segmentation learning. These techniques address limitations in the resolution and accuracy of predicted segmentation maps of recent state-of-the-art methods. Firstly, we leverage image post-processing techniques such as guided filtering to refine the output masks, improving accuracy while avoiding substantial computational costs. Secondly, we introduce a multi-scale consistency criterion, based on a teacher-student training scheme. This criterion matches segmentation masks predicted from regions of the input image extracted at different resolutions to each other. Experimental results on several benchmarks used in unsupervised segmentation learning demonstrate the effectiveness of our proposed techniques.

Boosting Unsupervised Segmentation Learning

TL;DR

This work tackles the limited resolution of unsupervised segmentation masks produced by state-of-the-art methods that rely on downsampled features. It introduces two practical techniques: guided filtering using the luminance channel as guidance to refine segmentation masks with negligible compute overhead, and a multi-scale consistency criterion implemented via a teacher-student framework with a cropping-based equivariance loss and a stop-gradient mechanism to prevent mask collapse. The methods deliver SotA results on unsupervised saliency benchmarks DUT-OMRON, DUTS-TE, and ECSSD, and yield improvements in CorLoc for unsupervised single-object detection on VOC and COCO20K, while remaining backbones-agnostic. The approach is modular and easy to apply across diverse unsupervised segmentation methods, with code to be released and extensive ablations demonstrating guided filtering as a key driver of gains.

Abstract

We present two practical improvement techniques for unsupervised segmentation learning. These techniques address limitations in the resolution and accuracy of predicted segmentation maps of recent state-of-the-art methods. Firstly, we leverage image post-processing techniques such as guided filtering to refine the output masks, improving accuracy while avoiding substantial computational costs. Secondly, we introduce a multi-scale consistency criterion, based on a teacher-student training scheme. This criterion matches segmentation masks predicted from regions of the input image extracted at different resolutions to each other. Experimental results on several benchmarks used in unsupervised segmentation learning demonstrate the effectiveness of our proposed techniques.
Paper Structure (13 sections, 6 equations, 4 figures, 5 tables)

This paper contains 13 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The illustration of the proposed multi-scale consistency procedure. First, we predict a segmentation from an input image. Then, we randomly crop a portion of the input image, predict a more detailed segmentation mask, refine this prediction with guided filtering, and apply a stop-gradient operation to the final mask to prevent a mask prediction collapse. Finally, we calculate the mean squared error between the corresponding region of the initial mask and the detailed target mask.
  • Figure 2: A qualitative comparison between the baseline and the improved segmentation with our proposed tricks (Ours). The first row shows images sampled from DUTS-TE wang2017learning, the second row shows the baseline segmentation predictions, the third row shows the segmentation results with our tricks, and the last row shows the ground truth segmentation masks. In (a) we show 4 image samples where we achieve a significant improvement over the baseline; in (b) we show two image samples where we achieve the same results as in the baseline; in (c) we show two image samples where the tricks make the segmentation masks worse (relative to the ground truth mask). We point out that some of the incorrect masks may also be due to the inherent ambiguity of the unsupervised segmentation task.
  • Figure 3: Some examples cases from Pascal VOC12 everingham2012pascal where our tricks improve object detection over baseline.
  • Figure 4: Some of the failure cases of our method from DUTS-TE wang2017learning. In most cases the saliency is unambiguous or the background and the foreground are almost indistinguishable.