Table of Contents
Fetching ...

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles open-vocabulary semantic segmentation under practical text imperfections by introducing AttrSeg, a decomposition-aggregation framework that first splits coarse class names into diverse attribute descriptions and then hierarchically aggregates them into a discriminative representation for segmentation. It leverages two decomposition strategies—LLM-generated attributes and manually collected attributes from a novel Fantastic Beasts dataset—to address ambiguity, neologisms, and unnameability, and employs a hierarchical fusion with clustering to align vision and attribute modalities. The method yields state-of-the-art or competitive results across PASCAL-5i, COCO-20i, PASCAL Context, PASCAL VOC, and Fantastic Beasts, along with thorough ablations revealing the value of hierarchical aggregation, cross-modal fusion, and attribute diversity. The practical impact lies in enabling robust OVSS in real-world settings where textual category names are imperfect or novel, supported by new attribute-annotated datasets and extensive analyses.

Abstract

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

TL;DR

This work tackles open-vocabulary semantic segmentation under practical text imperfections by introducing AttrSeg, a decomposition-aggregation framework that first splits coarse class names into diverse attribute descriptions and then hierarchically aggregates them into a discriminative representation for segmentation. It leverages two decomposition strategies—LLM-generated attributes and manually collected attributes from a novel Fantastic Beasts dataset—to address ambiguity, neologisms, and unnameability, and employs a hierarchical fusion with clustering to align vision and attribute modalities. The method yields state-of-the-art or competitive results across PASCAL-5i, COCO-20i, PASCAL Context, PASCAL VOC, and Fantastic Beasts, along with thorough ablations revealing the value of hierarchical aggregation, cross-modal fusion, and attribute diversity. The practical impact lies in enabling robust OVSS in real-world settings where textual category names are imperfect or novel, supported by new attribute-annotated datasets and extensive analyses.

Abstract

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.
Paper Structure (42 sections, 10 equations, 7 figures, 11 tables)

This paper contains 42 sections, 10 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Left: Open-vocabulary semantic segmentation (OVSS) assumes the given new textual categories are accurate, complete, and exist in pre-trained lexicons. However, in real-life situations, practical uses are limited due to textual ambiguity, neologisms, and unnameability. Middle: We propose a novel attribute decomposition-aggregation framework where vanilla class names are first decomposed into various attribute descriptions (decomposition stage), and then, different attribute representations are aggregated hierarchically into a final class representation for further segmentations (aggregation stage). Right: Our framework successfully addresses the aforementioned issues and facilitates more practical applications of OVSS in real-world scenarios.
  • Figure 2: Overview of Attribute Decomposition-Aggregation Framework. (a) Decomposition stage aims to decouple vanilla class names into various attribute descriptions. We design two strategies to build attributes, i.e., using LLMs and manual collections. (b) Aggregation stage aims to merge separated attribute representations into an integrated global description. We propose to hierarchically aggregate attribute tokens to one specific token in $L$ stages. Each stage alternates a fusion module and a clustering module. Masks are generated by calculating the similarity.
  • Figure 3: Comparison between Various Aggregation Strategies. The orange / blue colors represent visual / attribute tokens, respectively. Detailed discussions can be found in Sec. \ref{['Aggregation Strategies Discussion']}.
  • Figure 4: Visualizations of Fantastic Beasts (Part 1/4). Images, predicted segmentation masks, category names, and some corresponding main attributes are presented.
  • Figure 5: Visualizations of Fantastic Beasts (Part 2/4). Images, predicted segmentation masks, category names, and some corresponding main attributes are presented.
  • ...and 2 more figures