Table of Contents
Fetching ...

Open Vocabulary Semantic Scene Sketch Understanding

Ahmed Bourouis, Judith Ellen Fan, Yulia Gryaditskaya

TL;DR

This work tackles open-vocabulary semantic scene sketch understanding by building a CLIP-pretrained Vision Transformer encoder and fine-tuning it with visual prompts in a two-level hierarchical framework that separately handles holistic scene encoding and category-level disentanglement. A key innovation is the incorporation of value-value self-attention (v-v) and category-aware cross-attention, enabling robust, text-guided segmentation of freehand sketches without pixel-level annotations. The model achieves state-of-the-art results on FS-COCO freehand sketches, significantly outperforming zero-shot and language-supervised baselines and showing strong generalization to unseen categories and external datasets. The work also includes a thorough human–model alignment analysis, revealing areas for improvement in handling ambiguity and highly interconnected categories, and provides extensive supplementary analyses to support generalization and robustness claims.

Abstract

We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories, we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly, we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding, and the second level focuses on individual categories. We, then, in the second level of the hierarchy, introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of $85.5\%$ on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches.

Open Vocabulary Semantic Scene Sketch Understanding

TL;DR

This work tackles open-vocabulary semantic scene sketch understanding by building a CLIP-pretrained Vision Transformer encoder and fine-tuning it with visual prompts in a two-level hierarchical framework that separately handles holistic scene encoding and category-level disentanglement. A key innovation is the incorporation of value-value self-attention (v-v) and category-aware cross-attention, enabling robust, text-guided segmentation of freehand sketches without pixel-level annotations. The model achieves state-of-the-art results on FS-COCO freehand sketches, significantly outperforming zero-shot and language-supervised baselines and showing strong generalization to unseen categories and external datasets. The work also includes a thorough human–model alignment analysis, revealing areas for improvement in handling ambiguity and highly interconnected categories, and provides extensive supplementary analyses to support generalization and robustness claims.

Abstract

We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories, we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly, we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding, and the second level focuses on individual categories. We, then, in the second level of the hierarchy, introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches.
Paper Structure (54 sections, 5 equations, 15 figures, 10 tables)

This paper contains 54 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of the segmentation result obtained with CLIP visual encoder features and features from our model.
  • Figure 2: Our framework consists of two levels: I. Holistic Scene Sketch Understanding and II. Targeting individual categories disentanglement. Please refer to \ref{['sec:method']} for details.
  • Figure 3: Comparison of similarity maps obtained with classical attention computation (q-k attention) in the second row, with the ones obtained from v-v attention, given by \ref{['eq:attnvv']}.
  • Figure 4: Visualization of disentanglement over epochs.
  • Figure 5: Visual comparison of our method with CLIP Surgery$\star$$\star$. CLIP Surgery$\star$$\star$ represents the fine-tuned ViT from the CLIP model with v-v self-attention introduced at both training and inference stages. The numbers show Acc@P values.
  • ...and 10 more figures