CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho; Heeseong Shin; Sunghwan Hong; Anurag Arnab; Paul Hongsuck Seo; Seungryong Kim

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

TL;DR

This work addresses open-vocabulary semantic segmentation by leveraging a cost-aggregation paradigm that converts image-text relations from CLIP into a dense, multi-modal cost volume. CAT-Seg refines this cost volume through spatial and class-wise aggregation using Swin Transformer blocks and permutation-invariant class interactions, guided by embedding information and an efficient upsampling decoder. By fine-tuning CLIP encoders with carefully chosen strategies and avoiding heavy region-based proposals, CAT-Seg delivers state-of-the-art performance on standard benchmarks and strong cross-domain generalization (MEss), while maintaining practical efficiency. The approach demonstrates robust handling of unseen classes, reduced overfitting, and applicability across diverse domains, highlighting the viability of cost-volume reasoning for open-vocabulary segmentation.

Abstract

Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Paper Structure (41 sections, 4 equations, 13 figures, 12 tables)

This paper contains 41 sections, 4 equations, 13 figures, 12 tables.

Introduction
Related Work
Open-vocabulary semantic segmentation.
Fine-tuning vision-language models.
Cost aggregation.
Methodology
Cost Computation and Embedding
Spatial Cost Aggregation
Class Cost Aggregation
CAT-Seg Framework
Upsampling decoder.
Embedding guidance.
Efficient fine-tuning of CLIP
Experiments
Datasets and Evaluation
...and 26 more sections

Figures (13)

Figure 1: Comparison between feature and cost aggregation for open-vocabulary semantic segmentation task. In contrast to feature aggregation suffering severe overfitting to seen classes, cost aggregation can generalize to unseen classes and achieve significant performance improvements upon fine-tuning of CLIP.
Figure 2: Visualization of the cost volume. We visualize the raw cost volume obtained from frozen CLIP in (a) and fine-tuned CLIP in (b), and the aggregated cost in (c) through CAT-Seg. The top row correspond to the seen class "chair" and the bottom row correspond to the unseen class "sofa".
Figure 3: Overview of CAT-Seg. Our cost aggregation framework consists of spatial aggregation and class aggregation, followed by an upsampling decoder. Please refer to the supplementary material for a detailed illustration.
Figure 4: Qualitative comparison to SAN xu2023side. We visualize the results of PC-459 dataset in (a-c). For (d-f), we visualize the results from the MESS benchmark blumenstiel2023mess across three domains: underwater (top), human parts (middle), and agriculture (bottom).
Figure 5: Qualitative comparison between feature and cost aggregation. Our approach (d) successfully segments the previously unseen class, such as "birdcage," whereas approach (c) fails.
...and 8 more figures

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)