Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

Aleyna Kütük; Tevfik Metin Sezgin

Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

Aleyna Kütük, Tevfik Metin Sezgin

TL;DR

This work addresses scene sketch semantic segmentation by introducing the Class-Agnostic Visio-Temporal Network (CAVT), which leverages a class-agnostic, visio-temporal detector and a stroke-order–aware post-processing module to achieve stroke-level instance segmentation in scene sketches. A key novelty is performing segmentation at both the instance and stroke levels while remaining independent of predefined object categories, enabled by an RGB coloring technique that preserves temporal stroke order. To support this line of research, the FrISS dataset is introduced as the largest free-hand, vector-format scene sketch collection with dense instance- and stroke-level annotations, plus text and audio annotations, facilitating robust training and evaluation. Experimental results on FrISS and CBSC show state-of-the-art performance over prior scene sketch segmentation models, with ablations confirming the contributions of temporal ordering, class-agnostic training, and post-processing. The work thereby enables more coherent, instance-aware, stroke-preserving segmentation in hand-drawn scene sketches and provides resources for broader cross-modal and stroke-based studies.

Abstract

Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class-Agnostic Visio-Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class-agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post-processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free-hand scene sketch datasets with both instance and stroke-level class annotations. To fill this gap, we collected the largest Free-hand Instance- and Stroke-level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state-of-the-art scene sketch segmentation models. The code and dataset will be made public after acceptance.

Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

TL;DR

Abstract

Paper Structure (34 sections, 14 figures, 9 tables, 1 algorithm)

This paper contains 34 sections, 14 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Sketch Semantic Segmentation
Sketch Datasets
Methodology
Class-Agnostic Visio-Temporal Detector
Post-Processing Module
Synthetic Dataset Preparation for Training
The FrISS Dataset
Sketch Collection
Sketch Annotation
Statistics and Analysis
Experiments
Datasets
Sketch Classification
...and 19 more sections

Figures (14)

Figure 1: Sample scene sketches from FrISS dataset, each paired with corresponding textual scene descriptions. For each pair, the left image shows the black-and-white sketch, while the right image highlights the instance and stroke-level class annotations.
Figure 2: The overall pipeline of CAVT
Figure 3: Sample scenes taken from FrISS that are drawn by three individuals by referring to the same textual scene description
Figure 4: Visual comparison of our method with LDP ge2022exploring and OV bourouis2023open models that are evaluated on the FrISS-SS dataset.
Figure S1: Sample scene sketch from the CBSC, which demonstrates the input for our object detector model. Each stroke within the scene is color-coded based on drawing order, utilizing a spectrum ranging from blue to red, as illustrated at the bottom.
...and 9 more figures

Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

TL;DR

Abstract

Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)