Understanding Multi-Granularity for Open-Vocabulary Part Segmentation
Jiho Choi, Seonho Lee, Seungho Lee, Minhyun Lee, Hyunjung Shim
TL;DR
Open-vocabulary part segmentation (OVPS) faces generalization gaps, ambiguous boundaries, and underrepresented parts. The paper introduces PartCLIPSeg, which fuses generalized parts with object-level context using FiLM-conditioned CLIP embeddings and a CLIPSeg-based decoder. It adds two attention-control losses—separation and enhancement—and multi-level supervision to deliver robust multi-granularity segmentation, with strong improvements on Pascal-Part-116, ADE20K-Part-234, and PartImageNet, including unseen categories. The approach advances open-vocabulary, fine-grained segmentation and promises benefits for robotics, medical imaging, and image editing by enabling flexible recognition of object parts beyond fixed vocabularies.
Abstract
Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities using diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships within images. Through extensive experiments, our model demonstrated a significant improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.
