Semantic segmentation with reward
Xie Ting, Ye Huang, Zhilin Liu, Lixin Duan
TL;DR
This work tackles the scarcity of pixel-level labels in semantic segmentation by introducing RSS, a reward-based reinforcement learning framework that trains networks using pixel-level and image-level feedback. It introduces key mechanisms—SyncAN for synchronized normalization, Progressive Scale Rewards to reduce action space, and Pair-wise Spatial Difference to provide informative advantages—to enable convergence under image-level supervision. Experiments on Pascal Context, VOC2012, and Cityscapes show that image-level RSS can approach pixel-level supervision in performance and even surpass state-of-the-art weakly supervised methods. The approach holds practical potential for training semantic encoders with global feedback in real-world scenarios and could extend to other vision tasks with limited labeled data.
Abstract
In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.
