Table of Contents
Fetching ...

One-shot In-context Part Segmentation

Zhenqi Dai, Ting Liu, Xingxing Zhang, Yunchao Wei, Yanning Zhang

TL;DR

This paper addresses the challenge of part segmentation with minimal supervision by proposing OIParts, a training-free framework that leverages the complementary strengths of Visual Foundation Models, specifically DINOv2 for dense local descriptors and Stable Diffusion for global structure. By performing adaptive channel selection to minimize intra-class distance, OIParts creates discriminative, part-specific representations from a single in-context example, enabling accurate pixel-wise segmentation of novel objects under varied appearance, pose, and occlusion. Extensive experiments on PASCAL-Part and CelebAMask-HQ demonstrate that OIParts outperforms existing one-shot methods and rivals some ten-shot or supervised baselines, with notable improvements in challenging settings such as pose variation and partial visibility. The approach avoids labeled data and training, preserving generalization while delivering data-efficient, flexible segmentation via a simple but effective fusion of VFMs and a principled channel selection strategy. Overall, OIParts offers a practical, scalable solution for fine-grained part segmentation in diverse domains, highlighting the potential of training-free in-context methods for downstream vision tasks.

Abstract

In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.

One-shot In-context Part Segmentation

TL;DR

This paper addresses the challenge of part segmentation with minimal supervision by proposing OIParts, a training-free framework that leverages the complementary strengths of Visual Foundation Models, specifically DINOv2 for dense local descriptors and Stable Diffusion for global structure. By performing adaptive channel selection to minimize intra-class distance, OIParts creates discriminative, part-specific representations from a single in-context example, enabling accurate pixel-wise segmentation of novel objects under varied appearance, pose, and occlusion. Extensive experiments on PASCAL-Part and CelebAMask-HQ demonstrate that OIParts outperforms existing one-shot methods and rivals some ten-shot or supervised baselines, with notable improvements in challenging settings such as pose variation and partial visibility. The approach avoids labeled data and training, preserving generalization while delivering data-efficient, flexible segmentation via a simple but effective fusion of VFMs and a principled channel selection strategy. Overall, OIParts offers a practical, scalable solution for fine-grained part segmentation in diverse domains, highlighting the potential of training-free in-context methods for downstream vision tasks.

Abstract

In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Part segmentation results in various scenarios. Each in-context example is displayed on the left, with the part segmentation results generated by our OIParts highlighted in the dotted boxes.
  • Figure 2: The overall framework of our proposed OIParts. We acquire features for each image by extracting them from DINOv2 and SD. Initially, the in-context mask is first converted into a set of binary masks for each class. After applying channel selection to the features of both the in-context and query images, we compute a similarity score map for each pixel in the query image with all the pixels in the incontext image. This similarity score map is then used to aggregate the in-context image’s binary masks to generate the corresponding label.
  • Figure 3: The overall pipeline of the proposed channel selection.
  • Figure 4: Qualitative comparison with other methods. The three examples are from face, car, and horse datasets respectively. Existing methods exhibit three main issues: segmentation results not aligned with the query image, challenges in handling perspective differences, and difficulty with partially visible objects.
  • Figure 5: Comparison between our proposed method and SLiMe across various in-context examples. Evaluations were performed on 20 randomly selected query images using 8 distinct randomly selected in-context examples.
  • ...and 1 more figures