Table of Contents
Fetching ...

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

TL;DR

SED introduces a simple yet effective open-vocabulary semantic segmentation framework that combines a hierarchical encoder-based cost map with a gradual fusion decoder and a category early rejection mechanism. By using a hierarchical ConvNeXt backbone to generate pixel-level image-text cost maps and a two-stage fusion decoder that integrates multi-level features, SED preserves local spatial detail while maintaining linear computational cost. The category early rejection strategy accelerates inference by pruning non-existent categories early in decoding, achieving up to 4.7× speed-ups with minimal accuracy loss. Across multiple datasets, including ADE20K and PC-459, SED attains state-of-the-art or competitive mIoU while delivering practical inference times on standard GPUs. The approach highlights the benefits of hierarchical backbones and cost-map-based decoding for open-vocabulary segmentation and offers a path toward more efficient, scalable open-set vision systems.

Abstract

Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

TL;DR

SED introduces a simple yet effective open-vocabulary semantic segmentation framework that combines a hierarchical encoder-based cost map with a gradual fusion decoder and a category early rejection mechanism. By using a hierarchical ConvNeXt backbone to generate pixel-level image-text cost maps and a two-stage fusion decoder that integrates multi-level features, SED preserves local spatial detail while maintaining linear computational cost. The category early rejection strategy accelerates inference by pruning non-existent categories early in decoding, achieving up to 4.7× speed-ups with minimal accuracy loss. Across multiple datasets, including ADE20K and PC-459, SED attains state-of-the-art or competitive mIoU while delivering practical inference times on standard GPUs. The approach highlights the benefits of hierarchical backbones and cost-map-based decoding for open-vocabulary segmentation and offers a path toward more efficient, scalable open-set vision systems.

Abstract

Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond () per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.
Paper Structure (15 sections, 1 equation, 5 figures, 7 tables)

This paper contains 15 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Accuracy (mIoU) and speed (ms) comparison on A-150 and PC-459. Here, the speed is reported on a single NVIDIA A6000 GPU. Our proposed SED achieves an optimal trade-off in terms of speed and accuracy compared to existing methods in literature: SAN san, CAT-Seg catseg, OVSeg ovseg, DeOP Han2023ZeroShotSS, SimBaseline xu2022simple and ZegFormer zegformer.
  • Figure 2: Overall architecture of our proposed SED. We first employ hierarchical encoder (learnable) and text encoder (frozen) to generate pixel-level image-text cost map. Afterwards, we introduce a gradual fusion decoder to combine different feature maps of hierarchical encoder and cost map. The gradual fusion decoder stacks feature aggregation module (FAM) and skip-layer fusion module (SFM). In addition, we design a category early rejection (CER) in decoder to accelerate inference speed without sacrificing performance.
  • Figure 3: Structure of gradual fusion decoder. The gradual fusion decoder (GFD) first performs feature aggregation (a) in both spatial and class levels, and then employs skip-layer fusion (b) to combine the feature maps from previous decoder layer and hierarchical encoder.
  • Figure 4: Structure of category early rejection. During training (a), we attach an auxiliary convolution after each decoder layer to predict segmentation maps supervised by ground-truths. During inference (b), we employ top-$k$ strategy to predict existing categories and reject non-existing categories for next decoder layer.
  • Figure 5: Qualitative results. In the left part, we show some high-quality results, where our method can accurately classify and segment various categories. In the right-top part, we give some failure cases and corresponding ground-truths (GT). In the right-bottom part, we give one case in which our method can segment the cat that is not annotated in ground-truths (GT).