Table of Contents
Fetching ...

Do Generalised Classifiers really work on Human Drawn Sketches?

Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song

TL;DR

This work addresses generalised sketch recognition in two challenging axes: open-set category generalisation and across-sketch abstraction levels. It adapts the CLIP foundation model to sketches by learning sketch-specific vision and text prompts, introducing a raster-to-vector auxiliary objective, and leveraging a codebook-based abstraction module that mixes to cover a continuous abstraction spectrum. The approach yields strong few-shot and zero-shot performance across Edgemaps, TU-Berlin, and QuickDraw, and includes thorough ablations and abstraction-readout analyses, including evaluations on unseen abstractions like CLIPasso drawings. Collectively, SketchCLIP demonstrates that a foundation-model–driven, abstraction-aware prompting framework can serve as a robust backbone for open-set, cross-abstraction sketch recognition.

Abstract

This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings -- a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i) generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches), both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already stellar generalisation ability of CLIP to benefit generalised learning for sketches. We first "condition" the vanilla CLIP model by learning sketch-specific prompts using a novel auxiliary head of raster to vector sketch conversion. This importantly makes CLIP "sketch-aware". We then make CLIP acute to the inherently different sketch abstraction levels. This is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of sketches across abstraction levels -- low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw. Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across different abstraction boundaries.

Do Generalised Classifiers really work on Human Drawn Sketches?

TL;DR

This work addresses generalised sketch recognition in two challenging axes: open-set category generalisation and across-sketch abstraction levels. It adapts the CLIP foundation model to sketches by learning sketch-specific vision and text prompts, introducing a raster-to-vector auxiliary objective, and leveraging a codebook-based abstraction module that mixes to cover a continuous abstraction spectrum. The approach yields strong few-shot and zero-shot performance across Edgemaps, TU-Berlin, and QuickDraw, and includes thorough ablations and abstraction-readout analyses, including evaluations on unseen abstractions like CLIPasso drawings. Collectively, SketchCLIP demonstrates that a foundation-model–driven, abstraction-aware prompting framework can serve as a robust backbone for open-set, cross-abstraction sketch recognition.

Abstract

This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings -- a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i) generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches), both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already stellar generalisation ability of CLIP to benefit generalised learning for sketches. We first "condition" the vanilla CLIP model by learning sketch-specific prompts using a novel auxiliary head of raster to vector sketch conversion. This importantly makes CLIP "sketch-aware". We then make CLIP acute to the inherently different sketch abstraction levels. This is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of sketches across abstraction levels -- low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw. Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across different abstraction boundaries.
Paper Structure (14 sections, 10 equations, 7 figures, 4 tables)

This paper contains 14 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Unlike photos, sketch classification poses additional challenges such as abstraction -- humans draw differently based on their subjective interpretations, sketching ability, and drawing time. Existing datasets such as TU-Berlin berlin and QuickDraw quickdraw only capture the time axis and collect human sketches drawn under $280$ and $20$ seconds, respectively. Following berlinquickdrawhertzmann2020linevinker2022clipasso we consider Edgemaps (EM) as low abstract, TU-Berlin (TU) sketches as medium abstract, and QuickDraw (QD) ones as highly abstract (left). Naively training CLIP via prompt learning, zhou2022learning on sketches (right) from one abstraction level (QD, TU, or EM) individually do not generalise across varying abstractions (QD + TU + EM). Jointly training on multiple abstractions (QD + TU + EM) is also sub-optimal ($45.6$ on CoOp zhou2022learning vs $62.9$ on Ours). (middle) Importantly, our proposed method predicts a classification score and an abstraction level for input sketches. Plotting classification accuracy vs predicted abstraction level, reveals a scope for improvement (shaded region) showing -- despite our significant improvement in classification ($\uparrow$ 17.4%) over naive CLIP + prompt learning, generalisation across varying sketch abstractions is still an open problem. We hope this will motivate future works to democratise existing methods clipzhou2022conditional for human drawn sketches.
  • Figure 2: Plotting number of sketches vs class membership of $600$ sketch instances, defined by class-labels ($\hat{\mathbb{A}}_{l}, \hat{\mathbb{A}}_{m}, \hat{\mathbb{A}}_{h}$) and softmax normalised distributions ($\mathbb{A}_{l}, \mathbb{A}_{m}, \mathbb{A}_{h}$). Sketches are taken from $20$ unseen categories common across QD, TU, and EM with $10$ sketches per category. Despite expected peaks, a significant number of sketches lie in the continuous spectrum between $\hat{\mathbb{A}}_{l} \to \hat{\mathbb{A}}_{m}$ and $\hat{\mathbb{A}}_{m} \to \hat{\mathbb{A}}_{h}$ (overlaps).
  • Figure 3: Given an input sketch, we compute visual feature $f_{s}$ with sketch prompts using a CLIP image encoder. Next, $f_{s}$ is fed into 4 pipelines: (i) An auxiliary raster-to-vector (sketch2vec) translation module that distils sketch-specific traits. (ii) A Meta-Net to predict an instance-specific context $\pi$ to generalise on unseen sketches. (iii) A codebook classifier $\mathcal{C}_{\theta}$ to
  • Figure 4: Human study to rank $3$ sketches from same category and same dataset into low, medium, or high abstraction levels.
  • Figure 5: User opinion on abstraction rankings
  • ...and 2 more figures