Table of Contents
Fetching ...

Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation

Xingye Fan, Zhongwen, Zhang, Yuri Boykov

TL;DR

This work introduces approximate size targets as a practical image-level supervision signal for semantic segmentation. By computing the image-wide average prediction $\bar{S}$ and optimizing a forward KL loss $L_{size}=KL(v\|ar{S})$ in combination with a CRF regularizer, the method achieves segmentation accuracy competitive with full pixel-level supervision on datasets like PASCAL VOC, and remains robust to substantial errors in size targets. The approach integrates with scribble/seeds and extends to human-annotated and medical datasets, outperforming size-barrier baselines and many single-stage methods while matching or surpassing more complex multi-stage approaches. The findings suggest that simple, approximate size information can effectively guide segmentation, reducing annotation burden and enabling robust performance across domains.

Abstract

This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.

Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation

TL;DR

This work introduces approximate size targets as a practical image-level supervision signal for semantic segmentation. By computing the image-wide average prediction and optimizing a forward KL loss in combination with a CRF regularizer, the method achieves segmentation accuracy competitive with full pixel-level supervision on datasets like PASCAL VOC, and remains robust to substantial errors in size targets. The approach integrates with scribble/seeds and extends to human-annotated and medical datasets, outperforming size-barrier baselines and many single-stage methods while matching or surpassing more complex multi-stage approaches. The findings suggest that simple, approximate size information can effectively guide segmentation, reducing annotation burden and enabling robust performance across domains.

Abstract

This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.

Paper Structure

This paper contains 13 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Semantic segmentation from image-level supervision: test results for training by (a) log-barriers \ref{['eq:kolesnikov_expand_loss']} and (b) our approximate size targets \ref{['eq:size_loss']}. Full GT-mask supervision results are in (c).
  • Figure 2: Forward vs reverse KL divergence. Assuming binary classification $K=2$, we can represent all possible probability distributions as points on the interval [0,1]. The solid curves illustrate our "strong" size constraint, i.e. the forward KL-divergence $KL(v\|\bar{S})$ for the average prediction $\bar{S}$. We show two examples of volumetric prior $v_1=(0.9,0.1)$ (blue curve) and $v_2=(0.5,0.5)$ (red curve). For comparison, the dashed curves represent reverse KL divergence $KL(\bar{S}\|v)$ commonly used in the prior art.
  • Figure 3: Segmentation accuracy for our approximate size-target supervision on PASCAL's training and validation data. The segmentation is trained using losses \ref{['eq:size_loss']} (red) or \ref{['eq: our loss1']} (blue), where size targets are subject to various levels of corruption (\ref{['eq:synthetic_targets']},\ref{['eq:synthetic_RE']}).
  • Figure 4: Segmentation accuracy for scribble supervision with and without our approximate size targets on PASCAL's validation data.
  • Figure 5: Human annotation quality: relative errors histograms on PASCAL VOC classes (dog, cat, and bird). The histograms are normalized over the image count in each class. The relative size errors \ref{['eq:RE']} average to $mRE= 15.6\%$. For comparison, the dashed line shows the distribution of relative errors for our synthetic corruption of size targets \ref{['eq:synthetic_targets']} for $\sigma=19.5$ corresponding to the same $mRE= 15.6\%$\ref{['eq:synthetic_RE']}.
  • ...and 4 more figures