Table of Contents
Fetching ...

Semantic segmentation with reward

Xie Ting, Ye Huang, Zhilin Liu, Lixin Duan

TL;DR

This work tackles the scarcity of pixel-level labels in semantic segmentation by introducing RSS, a reward-based reinforcement learning framework that trains networks using pixel-level and image-level feedback. It introduces key mechanisms—SyncAN for synchronized normalization, Progressive Scale Rewards to reduce action space, and Pair-wise Spatial Difference to provide informative advantages—to enable convergence under image-level supervision. Experiments on Pascal Context, VOC2012, and Cityscapes show that image-level RSS can approach pixel-level supervision in performance and even surpass state-of-the-art weakly supervised methods. The approach holds practical potential for training semantic encoders with global feedback in real-world scenarios and could extend to other vision tasks with limited labeled data.

Abstract

In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.

Semantic segmentation with reward

TL;DR

This work tackles the scarcity of pixel-level labels in semantic segmentation by introducing RSS, a reward-based reinforcement learning framework that trains networks using pixel-level and image-level feedback. It introduces key mechanisms—SyncAN for synchronized normalization, Progressive Scale Rewards to reduce action space, and Pair-wise Spatial Difference to provide informative advantages—to enable convergence under image-level supervision. Experiments on Pascal Context, VOC2012, and Cityscapes show that image-level RSS can approach pixel-level supervision in performance and even surpass state-of-the-art weakly supervised methods. The approach holds practical potential for training semantic encoders with global feedback in real-world scenarios and could extend to other vision tasks with limited labeled data.

Abstract

In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.

Paper Structure

This paper contains 31 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Architecture comparison between pixel-level supervised learning and reinforcement learning. Zoom in to see better. CE loss: cross-entropy loss.
  • Figure 2: Overall architecture of image-level reinforcement learning. Zoom in to see better. CE loss: Cross-entropy loss. H: Height. W: Width. $\alpha$: Number of actions.
  • Figure 3: Scaling multi-resolution ($S_{1}$ to $S_{N}$) action maps helps to reduce the action sampling space. Zoom in to see better. AM: Action map. H: Height. W: Width. $\alpha$: Number of actions.
  • Figure 4: Utilizing the Pairwise Spatial Difference (PSD) between action maps $AM_{i}$ and $AM_{j}$, with their relevant score maps $SM_{i}$ and $SM_{j}$ to calculate the advantages of $AM_{i}$ over $AM_{j}$. Zoom in to see better. AM: Action map. SM: Score map.
  • Figure 5: Visualization of RSS (image-level reward) on Pascal Context dataset. Although image-level rewards only provide a global perspective of supervision signals, the performance of the final models is not much worse than that of traditional pixel-level supervision (PSL).
  • ...and 2 more figures