Table of Contents
Fetching ...

Co-Segmentation without any Pixel-level Supervision with Application to Large-Scale Sketch Classification

Nikolaos-Antonios Ypsilantis, Ondřej Chum

TL;DR

This work proposes a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images, that uses no pixel-level supervision for training and shows that sketch recognition significantly benefits when the classifier is trained on sketch-like structures extracted from the co-segmented area rather than from the full image.

Abstract

This work proposes a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images, that uses no pixel-level supervision for training. Two pre-trained Vision Transformer (ViT) models are exploited: ImageNet classification-trained ViT, whose features are used to estimate rough object localization through intra-class token relevance, and a self-supervised DINO-ViT for intra-image token relevance. On recent challenging benchmarks, the method achieves state-of-the-art performance among methods trained with the same level of supervision (image labels) while being competitive with methods trained with pixel-level supervision (binary masks). The benefits of the proposed co-segmentation method are further demonstrated in the task of large-scale sketch recognition, that is, the classification of sketches into a wide range of categories. The limited amount of hand-drawn sketch training data is leveraged by exploiting readily available image-level-annotated datasets of natural images containing a large number of classes. To bridge the domain gap, the classifier is trained on a sketch-like proxy domain derived from edges detected on natural images. We show that sketch recognition significantly benefits when the classifier is trained on sketch-like structures extracted from the co-segmented area rather than from the full image. Code: https://github.com/nikosips/CBNC .

Co-Segmentation without any Pixel-level Supervision with Application to Large-Scale Sketch Classification

TL;DR

This work proposes a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images, that uses no pixel-level supervision for training and shows that sketch recognition significantly benefits when the classifier is trained on sketch-like structures extracted from the co-segmented area rather than from the full image.

Abstract

This work proposes a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images, that uses no pixel-level supervision for training. Two pre-trained Vision Transformer (ViT) models are exploited: ImageNet classification-trained ViT, whose features are used to estimate rough object localization through intra-class token relevance, and a self-supervised DINO-ViT for intra-image token relevance. On recent challenging benchmarks, the method achieves state-of-the-art performance among methods trained with the same level of supervision (image labels) while being competitive with methods trained with pixel-level supervision (binary masks). The benefits of the proposed co-segmentation method are further demonstrated in the task of large-scale sketch recognition, that is, the classification of sketches into a wide range of categories. The limited amount of hand-drawn sketch training data is leveraged by exploiting readily available image-level-annotated datasets of natural images containing a large number of classes. To bridge the domain gap, the classifier is trained on a sketch-like proxy domain derived from edges detected on natural images. We show that sketch recognition significantly benefits when the classifier is trained on sketch-like structures extracted from the co-segmented area rather than from the full image. Code: https://github.com/nikosips/CBNC .

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a) Original RGB image of the class "guitar"; (b) Patch-level class relevance based on inter-image ImageNet ViT token similarity (low values in blue, high values in red); (c) Patch-level segmentation based on DINO-ViT intra-image token similarity with bias from the class relevance, followed by refinement via GrabCut; (d) training example for sketch recognition training extracted from the whole image as in etc22, (e) from the object mask; (f) test time sketch fed to the sketch classifier. We show that training with examples like (e) instead of (d) improves the performance of sketch classifiers trained without sketches.
  • Figure 2: A diagram presenting the pipeline of the proposed co-segmentation method. This pipeline is also used as a preprocessing for the training set of the sketch recognition task.
  • Figure 3: Comparison of class relevance heatmaps produced by ImageNet-ViT (left) vs. DINO-ViT (right) for two different classes. It is observed that ImageNet features provide more discriminative class relevance.
  • Figure 5: (a) Image of the class "hat". (b) Segmentation obtained by the N-cut method. (c) Class relevance heatmap, values of 0 are in deep blue, positive values range from blue (low) to red (high). (d) Segmentation obtained by the proposed method, using the class relevance bias. Without the class information, the N-cut segmentation fails to focus on the class object.
  • Figure 6: Qualitative comparison of the biased N-Cut segmentation using affinity from DINO ViT features (left) and ImageNet ViT features (middle). Both use the class-relevance bias estimated by ImageNet ViT features (right).
  • ...and 4 more figures