Table of Contents
Fetching ...

Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Lin Chen, Qi Yang, Kun Ding, Zhihao Li, Gang Shen, Fei Li, Qiyuan Cao, Shiming Xiang

TL;DR

This work tackles open-vocabulary semantic segmentation by addressing two core redundancies in cost-based OVSS: redundant class handling and inefficiencies in cost aggregation. It introduces ERR-Seg, featuring Redundancy-Reduced Hierarchical Cost maps (RRHC) to tailor per-image vocabularies and hierarchical, multi-layer semantic cost maps, and Redundancy-Reduced Cost Aggregation (RRCA) to compress spatial and class sequences before attention. The approach achieves state-of-the-art efficiency and accuracy on multiple benchmarks, notably delivering large latency reductions with improved or competitive mIoU scores. Analyses confirm that reducing redundancy sharpens both spatial-level and class-level contextual modeling while significantly cutting computation, making open-vocabulary segmentation more practical for real-world use. The results suggest ERR-Seg as a strong, scalable framework for dense open-vocabulary understanding and a foundation for future integration with stronger vision backbones and multimodal reasoning models.

Abstract

Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. While large-scale vision-language models have shown remarkable open-vocabulary capabilities, their image-level pretraining limits effectiveness on pixel-wise dense prediction tasks like OVSS. Recent cost-based methods narrow this granularity gap by constructing pixel-text cost maps and refining them via cost aggregation mechanisms. Despite achieving promising performance, these approaches suffer from high computational costs and long inference latency. In this paper, we identify two major sources of redundancy in the cost-based OVSS framework: redundant information introduced during cost maps construction and inefficient sequence modeling in cost aggregation. To address these issues, we propose ERR-Seg, an efficient architecture that incorporates Redundancy-Reduced Hierarchical Cost maps (RRHC) and Redundancy-Reduced Cost Aggregation (RRCA). Specifically, RRHC reduces redundant class channels by customizing a compact class vocabulary for each image and integrates hierarchical cost maps to enrich semantic representation. RRCA alleviates computational burden by performing both spatial-level and class-level sequence reduction before aggregation. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods on the ADE20K-847 benchmark, ERR-Seg improves performance by $5.6\%$ while achieving a 3.1$\times$ speedup.

Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles open-vocabulary semantic segmentation by addressing two core redundancies in cost-based OVSS: redundant class handling and inefficiencies in cost aggregation. It introduces ERR-Seg, featuring Redundancy-Reduced Hierarchical Cost maps (RRHC) to tailor per-image vocabularies and hierarchical, multi-layer semantic cost maps, and Redundancy-Reduced Cost Aggregation (RRCA) to compress spatial and class sequences before attention. The approach achieves state-of-the-art efficiency and accuracy on multiple benchmarks, notably delivering large latency reductions with improved or competitive mIoU scores. Analyses confirm that reducing redundancy sharpens both spatial-level and class-level contextual modeling while significantly cutting computation, making open-vocabulary segmentation more practical for real-world use. The results suggest ERR-Seg as a strong, scalable framework for dense open-vocabulary understanding and a foundation for future integration with stronger vision backbones and multimodal reasoning models.

Abstract

Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. While large-scale vision-language models have shown remarkable open-vocabulary capabilities, their image-level pretraining limits effectiveness on pixel-wise dense prediction tasks like OVSS. Recent cost-based methods narrow this granularity gap by constructing pixel-text cost maps and refining them via cost aggregation mechanisms. Despite achieving promising performance, these approaches suffer from high computational costs and long inference latency. In this paper, we identify two major sources of redundancy in the cost-based OVSS framework: redundant information introduced during cost maps construction and inefficient sequence modeling in cost aggregation. To address these issues, we propose ERR-Seg, an efficient architecture that incorporates Redundancy-Reduced Hierarchical Cost maps (RRHC) and Redundancy-Reduced Cost Aggregation (RRCA). Specifically, RRHC reduces redundant class channels by customizing a compact class vocabulary for each image and integrates hierarchical cost maps to enrich semantic representation. RRCA alleviates computational burden by performing both spatial-level and class-level sequence reduction before aggregation. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods on the ADE20K-847 benchmark, ERR-Seg improves performance by while achieving a 3.1 speedup.

Paper Structure

This paper contains 40 sections, 1 theorem, 20 equations, 9 figures, 14 tables.

Key Result

Proposition 1

The relationship between the contribution ratio of $\mathcal{Q}_r$, denoted as $\Delta _r$, and the contribution ratio of $\mathcal{Q}_p$, denoted as $\Delta _p$, is expressed as:

Figures (9)

  • Figure 1: Performance vs. latency on ADE20K-847. Compared with ZegFormer ding2022decoupling, OVSeg liang2023open, DeOP han2023open, SAN xu2023side, SED xie2024sed and CAT-Seg cho2024cat, ERR-Seg achieves a new state-of-the-art with lower latency.
  • Figure 2: Visual comparison of attention maps during cost aggregation. (a) Using our proposed cost maps (RRHC) with redundancy reduction (48 classes) and (b) using the original full cost maps (847 classes). The results show that the attention mechanism more effectively captures spatial-level long-range dependencies in our proposed RRHC.
  • Figure 3: Overall architecture of ERR-Seg. Initially, redundancy-reduced hierarchical cost maps are generated by extracting cost maps from middle-layer features and eliminating class redundancy. Subsequently, the sequence length is reduced before cost aggregation to speed up the computation. Finally, the upsampling decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features of CLIP's visual encoder.
  • Figure 4: Pipeline of our proposed redundant class reduction mechanism. It involves a training-free scoring function to assign scores to each class, retaining the top-$P$ classes while eliminating other redundant classes.
  • Figure 5: Visualization of segmentation results on ADE20K-150 and ADE20K-847. The results of previous state-of-the-art methods SAN xie2024sed and CAT-Seg cho2024cat are also included for comparison.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof