Table of Contents
Fetching ...

SSA-Seg: Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

Xiaowen Ma, Zhenliang Ni, Xinghao Chen

TL;DR

The coarse masks obtained from the fixed prototypes are employed as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image, and the adapted prototypes in semantic and spatial domains are simultaneously considered to accomplish classification decisions.

Abstract

Vanilla pixel-level classifiers for semantic segmentation are based on a certain paradigm, involving the inner product of fixed prototypes obtained from the training set and pixel features in the test image. This approach, however, encounters significant limitations, \ie, feature deviation in the semantic domain and information loss in the spatial domain. The former struggles with large intra-class variance among pixel features from different images, while the latter fails to utilize the structured information of semantic objects effectively. This leads to blurred mask boundaries as well as a deficiency of fine-grained recognition capability. In this paper, we propose a novel Semantic and Spatial Adaptive Classifier (SSA-Seg) to address the above challenges. Specifically, we employ the coarse masks obtained from the fixed prototypes as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image. The adapted prototypes in semantic and spatial domains are then simultaneously considered to accomplish classification decisions. In addition, we propose an online multi-domain distillation learning strategy to improve the adaption process. Experimental results on three publicly available benchmarks show that the proposed SSA-Seg significantly improves the segmentation performance of the baseline models with only a minimal increase in computational cost. Code is available at https://github.com/xwmaxwma/SSA-Seg.

SSA-Seg: Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

TL;DR

The coarse masks obtained from the fixed prototypes are employed as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image, and the adapted prototypes in semantic and spatial domains are simultaneously considered to accomplish classification decisions.

Abstract

Vanilla pixel-level classifiers for semantic segmentation are based on a certain paradigm, involving the inner product of fixed prototypes obtained from the training set and pixel features in the test image. This approach, however, encounters significant limitations, \ie, feature deviation in the semantic domain and information loss in the spatial domain. The former struggles with large intra-class variance among pixel features from different images, while the latter fails to utilize the structured information of semantic objects effectively. This leads to blurred mask boundaries as well as a deficiency of fine-grained recognition capability. In this paper, we propose a novel Semantic and Spatial Adaptive Classifier (SSA-Seg) to address the above challenges. Specifically, we employ the coarse masks obtained from the fixed prototypes as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image. The adapted prototypes in semantic and spatial domains are then simultaneously considered to accomplish classification decisions. In addition, we propose an online multi-domain distillation learning strategy to improve the adaption process. Experimental results on three publicly available benchmarks show that the proposed SSA-Seg significantly improves the segmentation performance of the baseline models with only a minimal increase in computational cost. Code is available at https://github.com/xwmaxwma/SSA-Seg.
Paper Structure (24 sections, 13 equations, 8 figures, 13 tables)

This paper contains 24 sections, 13 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: A example of vanilla pixel-level classifiers, where the SeaFormer-L seaformer is the baseline and the feature distribution is visualized with t-SNE. (a) is a test image of the ADE20K dataset, and (b) denotes the feature distributions in the semantic domain of (a), with purple and gray dots denoting the pixel features on the test image of the door and other categories, respectively. Blue star denotes the fixed prototype trained on training set of the door category. It shows that vanilla pixel-level classifiers directly interact pixel features with the fixed semantic prototypes, which leads to feature deviation in the semantic domain and information loss in the spatial domain problems. In contrast, SSA-Seg makes classification decisions based on adaptive semantic and spatial prototypes by prompting the prototypes to offset toward the center of the semantic domain and the spatial domain, as shown in (c) and (d). Visual comparison of the baseline and SSA-Seg can be found in Fig. \ref{['fig:ssa_result']}.
  • Figure 2: SSA-Seg overview. For the semantic features $\mathcal{S}_f$ output from the backbone and decode head, we first generate spatial features $\mathcal{P}_f$ by position encode. Then we retain the original $1 \times 1$ convolution to generate the coarse mask $\mathcal{M}_c$. Guided by $\mathcal{M}_c$, we generate the center of the semantic domain and spatial domain in the pre-classified representations and fused them with the fixed semantic prototypes $\mathcal{S}$ and the prototype position basis $\mathcal{P}$ to generate the semantic prototypes $\mathcal{S}_p$ and the spatial prototype $\mathcal{P}_p$. Finally, we consider simultaneously semantic and spatial prototypes to perform classification decisions. The right figure shows an online teacher classifier only for training, where the coarse mask is replaced with ground-truth mask to participate in model training, and constrains the prototype adaption and transfer accurate semantic and spatial knowledge to the primary classifier based on multi-domain distillation learning.
  • Figure 3: Visualization of the inter-class relation matrix for the semantic prototypes $\mathcal{S}_p$ and $\hat{\mathcal{S}}_p$, and the latter possesses better inter-class separability. This motivates us to add semantic domain distillation loss to constrain the adaption of the semantic prototypes. The results show that after semantic domain distillation, the semantic prototypes have better separability, which facilitates category recognition.
  • Figure 4: mIoU of the validation set on (a) ADE20K and (b) COCO-Stuff-10K with iterations.
  • Figure 5: Comparison of SSA-Seg and Baseline (SeaFormer-L) results. Purple and gray indicate pixel features in the door and other categories, respectively. Orange star indicates the initial fixed prototype of wall category, and red star indicates the adapted semantic prototype.
  • ...and 3 more figures