Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation
Xingye Fan, Zhongwen, Zhang, Yuri Boykov
TL;DR
This work introduces approximate size targets as a practical image-level supervision signal for semantic segmentation. By computing the image-wide average prediction $\bar{S}$ and optimizing a forward KL loss $L_{size}=KL(v\|ar{S})$ in combination with a CRF regularizer, the method achieves segmentation accuracy competitive with full pixel-level supervision on datasets like PASCAL VOC, and remains robust to substantial errors in size targets. The approach integrates with scribble/seeds and extends to human-annotated and medical datasets, outperforming size-barrier baselines and many single-stage methods while matching or surpassing more complex multi-stage approaches. The findings suggest that simple, approximate size information can effectively guide segmentation, reducing annotation burden and enabling robust performance across domains.
Abstract
This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.
