Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Xiangheng Shan; Dongyue Wu; Guilin Zhu; Yuanjie Shao; Nong Sang; Changxin Gao

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Xiangheng Shan, Dongyue Wu, Guilin Zhu, Yuanjie Shao, Nong Sang, Changxin Gao

TL;DR

A novel framework for open-vocabulary semantic segmentation called EBSeg is proposed, incorpo-rating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss) to learn a consistent semantic structure from CLIP.

Abstract

Open-vocabulary semantic segmentation is a challenging task, which requires the model to output semantic masks of an image beyond a close-set vocabulary. Although many efforts have been made to utilize powerful CLIP models to accomplish this task, they are still easily overfitting to training classes due to the natural gaps in semantic information between training and new classes. To overcome this challenge, we propose a novel framework for openvocabulary semantic segmentation called EBSeg, incorporating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder is designed to generate different image embeddings for both training and new classes. Subsequently, these two types of embeddings are adaptively balanced to fully exploit their ability to recognize training classes and generalization ability for new classes. To learn a consistent semantic structure from CLIP, the SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP, thereby improving the generalization ability of our model. Furthermore, we employ a frozen SAM image encoder to complement the spatial information that CLIP features lack due to the low training image resolution and image-level supervision inherent in CLIP. Extensive experiments conducted across various benchmarks demonstrate that the proposed EBSeg outperforms the state-of-the-art methods. Our code and trained models will be here: https://github.com/slonetime/EBSeg.

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

TL;DR

Abstract

Paper Structure (20 sections, 20 equations, 8 figures, 10 tables)

This paper contains 20 sections, 20 equations, 8 figures, 10 tables.

Introduction
Related works
Method
Method Overview
Image Feature Extraction and Fusion
AdaB Decoder
SSC Loss
Adaptively Balancing and Inference
Experiments
Experiment Setup
Comparison with State-of-the-Art methods
Ablation Studies
Conclusion
Additional Experiments
Embedding Balancing Strategy and Weight
...and 5 more sections

Figures (8)

Figure 1: Illustration of our main idea. (a) Our AdaB Decoder(Adaptively Balanced Decoder) outputs image embeddings for both training classes(classes existing in the training set during test) and new classes(classes not existing in training set). By adaptively balancing these embeddings, our model performs better at both training and new classes. (b) We propose SSC Loss(Semantic Structure Consistency loss) that aligns the distribution of the image embeddings with that of the text embeddings. The SSC Loss helps our model learn the semantic structure of CLIP better and achieve better generalization capability for new classes.
Figure 2: The architecture of our model EBSeg. We first obtain image features from two frozen image encoders and fuse them in a feature fusion module. After that, the fused features are input into our AdaB Decoder, which outputs masks $\mathbf{M}$ and image embeddings (including mask attention embeddings $\mathbf{B}$, fully supervised embeddings $\mathbf{A}$ and frozen embeddings $\mathbf{D}$). During training, we apply the SSC Loss to learn a consistent semantic structure from CLIP. During inference, we adaptively balance the three embeddings output by AdaB Decoder and obtain semantic segmentation results with the masks, balanced image embeddings, and text embeddings.
Figure 3: Detailed structure of AdaB Decoder. We first input fused image features into the Pixel Decoder. The outputs of the first three stages are then fed to the Transformer Decoder which outputs image embeddings $\mathbf{A}$ and $\mathbf{A}^{'}$. Then we obtain masks $\mathbf{M}$ with $\mathbf{A}$ and the largest feature map $\mathbf{F}_{1}^{'}$ from the Pixel Decoder. We obtain per-head attention masks $\mathbf{M}_{attn}$ with per-head embeddings $\mathbf{A}^{'}$ and $\mathbf{F}_{1}^{'}$. Finally, we perform masked self-attention in the last few blocks of CLIP image encoder with $\mathbf{M}_{attn}$ to get mask attention embeddings $\mathbf{B}$.
Figure 4: Visualization examples of our model on ADE20K-150 validation set.
Figure 5: Qualitative results on the ADE20K-150 zhou2017scene validation set. We compare our approach with two other methods OVSeg xu2023side and SAN liang2023open. Thanks to our AdaB Decoder and SSC Loss, our model shows a stronger generalization ability for new classes that do not exist in the training dataset COCO-Stuff caesar2018coco, such as hovel in the third row and animal in the last row. Moreover, with the help of our AdaB Decoder, our model is able to better recognize training classes that exist in the training set, such as building in the first row and table, wall in the second row.
...and 3 more figures

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

TL;DR

Abstract

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)