Table of Contents
Fetching ...

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

TL;DR

This work addresses open-vocabulary semantic segmentation by leveraging a cost-aggregation paradigm that converts image-text relations from CLIP into a dense, multi-modal cost volume. CAT-Seg refines this cost volume through spatial and class-wise aggregation using Swin Transformer blocks and permutation-invariant class interactions, guided by embedding information and an efficient upsampling decoder. By fine-tuning CLIP encoders with carefully chosen strategies and avoiding heavy region-based proposals, CAT-Seg delivers state-of-the-art performance on standard benchmarks and strong cross-domain generalization (MEss), while maintaining practical efficiency. The approach demonstrates robust handling of unseen classes, reduced overfitting, and applicability across diverse domains, highlighting the viability of cost-volume reasoning for open-vocabulary segmentation.

Abstract

Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

TL;DR

This work addresses open-vocabulary semantic segmentation by leveraging a cost-aggregation paradigm that converts image-text relations from CLIP into a dense, multi-modal cost volume. CAT-Seg refines this cost volume through spatial and class-wise aggregation using Swin Transformer blocks and permutation-invariant class interactions, guided by embedding information and an efficient upsampling decoder. By fine-tuning CLIP encoders with carefully chosen strategies and avoiding heavy region-based proposals, CAT-Seg delivers state-of-the-art performance on standard benchmarks and strong cross-domain generalization (MEss), while maintaining practical efficiency. The approach demonstrates robust handling of unseen classes, reduced overfitting, and applicability across diverse domains, highlighting the viability of cost-volume reasoning for open-vocabulary segmentation.

Abstract

Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.
Paper Structure (41 sections, 4 equations, 13 figures, 12 tables)

This paper contains 41 sections, 4 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Comparison between feature and cost aggregation for open-vocabulary semantic segmentation task. In contrast to feature aggregation suffering severe overfitting to seen classes, cost aggregation can generalize to unseen classes and achieve significant performance improvements upon fine-tuning of CLIP.
  • Figure 2: Visualization of the cost volume. We visualize the raw cost volume obtained from frozen CLIP in (a) and fine-tuned CLIP in (b), and the aggregated cost in (c) through CAT-Seg. The top row correspond to the seen class "chair" and the bottom row correspond to the unseen class "sofa".
  • Figure 3: Overview of CAT-Seg. Our cost aggregation framework consists of spatial aggregation and class aggregation, followed by an upsampling decoder. Please refer to the supplementary material for a detailed illustration.
  • Figure 4: Qualitative comparison to SAN xu2023side. We visualize the results of PC-459 dataset in (a-c). For (d-f), we visualize the results from the MESS benchmark blumenstiel2023mess across three domains: underwater (top), human parts (middle), and agriculture (bottom).
  • Figure 5: Qualitative comparison between feature and cost aggregation. Our approach (d) successfully segments the previously unseen class, such as "birdcage," whereas approach (c) fails.
  • ...and 8 more figures