Table of Contents
Fetching ...

SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches

Subhadeep Koley, Viswanatha Reddy Gajjala, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song

TL;DR

SketchYourSeg提出一个 mask-free 的框架,利用单个示例自由手绘素描作为查询,在整个图片库中实现主观分割。它将冻结的 FG-SBIR 骨架与预训练的基础模型(如 CLIP 或 DINOv2)结合,通过一个可微分的草图引导过程生成像素级掩码,同时无需像素级标注。训练目标整合了四项损失,形成 $L_{total} = L_{InfoNCE} + L_{SBIR} + L_{unpaired} + L_{reg}$,实现类别级、细粒度与部位级的多颗粒度分割,并通过草图分割增强实现部位级分割。实验在 Sketchy 与 Sketchy-Extended 上显示出对看见类与看不见类的显著改进,验证了这一新的人机交互分割范式在精度与效率之间的良好权衡。

Abstract

We introduce SketchYourSeg, a novel framework that establishes freehand sketches as a powerful query modality for subjective image segmentation across entire galleries through a single exemplar sketch. Unlike text prompts that struggle with spatial specificity or interactive methods confined to single-image operations, sketches naturally combine semantic intent with structural precision. This unique dual encoding enables precise visual disambiguation for segmentation tasks where text descriptions would be cumbersome or ambiguous -- such as distinguishing between visually similar instances, specifying exact part boundaries, or indicating spatial relationships in composed concepts. Our approach addresses three fundamental challenges: (i) eliminating the need for pixel-perfect annotation masks during training with a mask-free framework; (ii) creating a synergistic relationship between sketch-based image retrieval (SBIR) models and foundation models (CLIP/DINOv2) where the former provides training signals while the latter generates masks; and (iii) enabling multi-granular segmentation capabilities through purpose-made sketch augmentation strategies. Our extensive evaluations demonstrate superior performance over existing approaches across diverse benchmarks, establishing a new paradigm for user-guided image segmentation that balances precision with efficiency.

SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches

TL;DR

SketchYourSeg提出一个 mask-free 的框架,利用单个示例自由手绘素描作为查询,在整个图片库中实现主观分割。它将冻结的 FG-SBIR 骨架与预训练的基础模型(如 CLIP 或 DINOv2)结合,通过一个可微分的草图引导过程生成像素级掩码,同时无需像素级标注。训练目标整合了四项损失,形成 ,实现类别级、细粒度与部位级的多颗粒度分割,并通过草图分割增强实现部位级分割。实验在 Sketchy 与 Sketchy-Extended 上显示出对看见类与看不见类的显著改进,验证了这一新的人机交互分割范式在精度与效率之间的良好权衡。

Abstract

We introduce SketchYourSeg, a novel framework that establishes freehand sketches as a powerful query modality for subjective image segmentation across entire galleries through a single exemplar sketch. Unlike text prompts that struggle with spatial specificity or interactive methods confined to single-image operations, sketches naturally combine semantic intent with structural precision. This unique dual encoding enables precise visual disambiguation for segmentation tasks where text descriptions would be cumbersome or ambiguous -- such as distinguishing between visually similar instances, specifying exact part boundaries, or indicating spatial relationships in composed concepts. Our approach addresses three fundamental challenges: (i) eliminating the need for pixel-perfect annotation masks during training with a mask-free framework; (ii) creating a synergistic relationship between sketch-based image retrieval (SBIR) models and foundation models (CLIP/DINOv2) where the former provides training signals while the latter generates masks; and (iii) enabling multi-granular segmentation capabilities through purpose-made sketch augmentation strategies. Our extensive evaluations demonstrate superior performance over existing approaches across diverse benchmarks, establishing a new paradigm for user-guided image segmentation that balances precision with efficiency.

Paper Structure

This paper contains 15 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Examples of fine-grained (left) and category-level (right) image pairs from Sketchy sangkloy2016the dataset.
  • Figure 2: Fine-grained (left) and part-level (right) segmentation.
  • Figure 3: We generate sketch-guided correlation map $\mathcal{C}$ by multiplying the global sketch feature $\mathbf{s}$ with reshaped patch-wise photo feature $\mathbf{p}_s$. Bilinear upscaling, followed by differentiable Sigmoid thresholding yields patch-wise correlation mask $\mathcal{M}$, which when multiplied with $\mathcal{P}$ gives masked candidate photo$\mathcal{P}_{\mathrm{mask}}$ (highlighted in grey). Frozen SBIR backbone $\mathcal{F}$ enforces the input sketches to generate masks that segment only the foreground object (queried via $\mathcal{S}$) via $\mathcal{L}_{\mathrm{SBIR}}$. $\mathcal{L}_{\mathrm{infoNCE}}$ between $\mathbf{s}$ and $\mathbf{p}_s$ enhances sketch-photo alignment. At every batch, $\mathcal{L}_{\mathrm{unpaired}}$ ensures an all-zero mask for negative samples ($\mathcal{N}$). To avoid overfitting to a trivial all-one mask, we additionally regularise $\mathcal{M}$ via $\mathcal{L}_{\mathrm{reg}}$.
  • Figure 4: Example of our sketch-partitioning augmentation. $\mathcal{S}$ is divided into $\mathcal{S}_A$ and $\mathcal{S}_B$ based on the straight lines from centroid (blue dot). Common foreground region of $\mathcal{M}$ is bordered in red.
  • Figure 5: Qualitative comparison on Sketchy-Extended liu2017deep for category-level segmentation on seen (left) and unseen (right) classes.
  • ...and 4 more figures