Table of Contents
Fetching ...

Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

Jiho Choi, Seonho Lee, Seungho Lee, Minhyun Lee, Hyunjung Shim

TL;DR

Open-vocabulary part segmentation (OVPS) faces generalization gaps, ambiguous boundaries, and underrepresented parts. The paper introduces PartCLIPSeg, which fuses generalized parts with object-level context using FiLM-conditioned CLIP embeddings and a CLIPSeg-based decoder. It adds two attention-control losses—separation and enhancement—and multi-level supervision to deliver robust multi-granularity segmentation, with strong improvements on Pascal-Part-116, ADE20K-Part-234, and PartImageNet, including unseen categories. The approach advances open-vocabulary, fine-grained segmentation and promises benefits for robotics, medical imaging, and image editing by enabling flexible recognition of object parts beyond fixed vocabularies.

Abstract

Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities using diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships within images. Through extensive experiments, our model demonstrated a significant improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.

Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

TL;DR

Open-vocabulary part segmentation (OVPS) faces generalization gaps, ambiguous boundaries, and underrepresented parts. The paper introduces PartCLIPSeg, which fuses generalized parts with object-level context using FiLM-conditioned CLIP embeddings and a CLIPSeg-based decoder. It adds two attention-control losses—separation and enhancement—and multi-level supervision to deliver robust multi-granularity segmentation, with strong improvements on Pascal-Part-116, ADE20K-Part-234, and PartImageNet, including unseen categories. The approach advances open-vocabulary, fine-grained segmentation and promises benefits for robotics, medical imaging, and image editing by enabling flexible recognition of object parts beyond fixed vocabularies.

Abstract

Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities using diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships within images. Through extensive experiments, our model demonstrated a significant improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.
Paper Structure (33 sections, 10 equations, 11 figures, 15 tables)

This paper contains 33 sections, 10 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Prediction results of our PartCLIPSeg for unseen categories in the Pascal-Part-116 chen2014detect_PascalPartwei2024ov_OV_PARTS validation set. A "dog" is unseen during training. The final prediction of PartCLIPSeg utilizes (b) object-level context and (c) generalized parts, incorporating disjoint activation among (e)--(i) parts, and enhancing activation for smaller parts (e.g., (h) "nose").
  • Figure 2: Limitations of existing OVPS methods in predicting unseen categories. (a) Lack of generalization: Classification of a "dog's parts" involving categories like "cats" and "sheep", "dog's tail" misclassified as "sheep's ear". (VLPart sun2023going_VLPart) (b) Ambiguous boundaries: Vague boundary output of "aeroplane's body". (c) Missing underrepresented parts: Neglecting parts such as "beak" and "leg". (CLIPSeg luddecke2022image_CLIPSegwei2024ov_OV_PARTS).
  • Figure 3: The overall architecture of PartCLIPSeg. The embeddings derived from the object category name and the part category name are conditioned using the FiLM operation. Each embedding, modified through attention control, is subsequently reconstructed to predict the final object-specific part results.
  • Figure 4: Example of attention control using separation and enhance losses. The proposed method manipulates attention maps to accurately identify and segment small parts.
  • Figure 5: Qualitative results of zero-shot part segmentation on Pascal-Part-116 in Pred-All setting. Annotations for unseen categories (bird, car, dog, sheep, etc.) are not included in the train set.
  • ...and 6 more figures