Table of Contents
Fetching ...

Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Yangyang Li, Xuanting Hao, Ronghua Shang, Licheng Jiao

TL;DR

This work introduces a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling.

Abstract

In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at https://github.com/haoxt/S4MIM.

Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

TL;DR

This work introduces a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling.

Abstract

In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at https://github.com/haoxt/S4MIM.

Paper Structure

This paper contains 16 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The illustration of the insight into MIM and the idea of our S4MIM. During the encoding-decoding ($E$-$PiD$) process, MIM learns knowledge through establishing the connections between the features corresponding to the masked and visible parts. Building upon this insight, there are three core components of our idea: Class-wise MIM, Class-wise Mask-induced Feature Aggregation, and MIM in Semantic Space.
  • Figure 2: The overview of our S4MIM. Each iteration of our approach includes two phases. In Phase I, as UniMatch described, labeled data is supervised by ground truth, while pseudo-labels generated from weakly perturbed unlabeled data guide its strongly perturbed version. In Phase II, constrained by pseudo-labels, masked data features in the pixel decoder $PiD$ are organized by class-wise grouping. These grouped features are then utilized in two branches. One branch decodes the grouped features via independent heads, then sums the outputs for final reconstruction. The other branch aggregates the features of each group by pulling them closer. Meanwhile, at the semantic decoder $SeD$'s output, the masked data is supervised by pseudo-labels derived from the original data.
  • Figure 3: The details of our Class-wise MIM (upper area) and Class-wise Mask-induced Feature Aggregation (lower area) with one masked-data stream. In the upper area, $fea$ is multiplied element-wise with each $\mathcal{Y}_{pse}\langle c \rangle$, producing $\{fea_{c}\}_{c=1}^{C}$. Each $fea_{c}$ is used with its corresponding $Head_{c}$ to recover the pixels for class $c$. By summing these predictions, the entire image is reconstructed. In the lower area, within each $fea_{c}$, available spatial positions in the visible and masked parts are denoted as sets $\Omega_{c}^{v}$ and $\Omega_{c}^{m}$, respectively. Vectors from $\Omega_{c}^{v}$ are used to construct the class prototype $\widetilde{\mathbf{v}}_{c}$ via weighted mean calculation and moving average updating. This prototype supervises the vectors from $\Omega_{c}^{m}$ to aggregate towards it by minimizing the weighted cosine distance.
  • Figure 4: Ablation study on the parameters of masking. (a) Masking ratio. (b) Masking patch size.
  • Figure 5: Visualization of the feature space in $\mathbf{SeD}$ on PASCAL VOC 2012 under 92 partition, using t-SNE tsne. (a) The semi-supervised baseline. (b) Our S4MIM.
  • ...and 2 more figures