Table of Contents
Fetching ...

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu, Yazhou Yao, Fumin Shen

Abstract

Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

Abstract

Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
Paper Structure (12 sections, 10 equations, 6 figures, 5 tables)

This paper contains 12 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Motivation of PCA-Seg. (a) Cost aggregation comparison among different methods for class runway. (b) The prevailing serial architecture. (c) Our proposed parallel cost aggregation paradigm. The FOD strategy decouples class-level semantics and spatial context, enabling the EPL module to parse diverse knowledge from the resulting orthogonalized features.
  • Figure 2: Comparison with other state-of-the-art methods on different benchmarks. "Ppart" and "Apart" are abbreviations for the Pascal-Part-116 wei2023ov and ADE20K-Part-234 wei2023ov datasets, respectively. 'P' and 'O' denote two different settings, namely Pred-All and Oracle-Obj. 'H' represents the harmonic IoU over both seen and unseen classes, while 'U' denotes the mIoU of unseen classes.
  • Figure 3: The framework of PCA-Seg on OVSS. Image and text features from CLIP’s visual and text encoders are combined via the Hadamard product to construct the cost volume $\mathcal{S}$. This volume is then processed simultaneously through spatial and class aggregation to produce the spatial context feature $\mathcal{B}_{n}$ and the class-level semantic representation $\mathcal{E}_{n}$. The multi-expert (ME-) parser extracts diverse and complementary knowledge from these two streams across multiple perspectives, while the coefficient (Co-) mapper adaptively learns weights to integrate the features parsed by the experts. $\mathcal{B}_{n}$ and $\mathcal{E}_{n}$ are decoupled using the feature orthogonalization decoupling (FOD) strategy to reduce redundancy, thereby providing enriched representations for expert-driven perceptual learning (EPL).
  • Figure 4: Comparison of qualitative results with serial cost aggregation-based state-of-the-art methods at different granularities. (a)–(d) illustrate open-vocabulary semantic segmentation on the PC-59 dataset, while (e)–(h) present open-vocabulary part segmentation on the Pascal-Part-116 benchmark. See $\S\ref{['Compare_method']}$ for details.
  • Figure 5: Redundancy analysis of feature knowledge among different expert blocks. See $\S\ref{['abla_studies']}$ for details.
  • ...and 1 more figures