Table of Contents
Fetching ...

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

Jiawei Wang, Changjian Li

TL;DR

ContextSeg tackles sketch semantic segmentation with a two-stage approach that explicitly encodes stroke structure and context. It introduces a CNN autoencoder-based stroke embedding augmented by a dense distance field DF^s and a segmentation Transformer that labels groups of strokes autoregressively using group codes and context, with losses L_em = L_recon + γL_dis and a focal loss for segmentation. The method yields state-of-the-art results on SPG and CreativeSketch and provides practical insights into cross-category training and semantic-aware data augmentation to address data imbalance. Overall, the work demonstrates that combining stroke-aware embeddings with group-based, context-driven decoding significantly improves fine-grained sketch segmentation and offers guidance for robust, cross-domain sketch understanding.

Abstract

Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

TL;DR

ContextSeg tackles sketch semantic segmentation with a two-stage approach that explicitly encodes stroke structure and context. It introduces a CNN autoencoder-based stroke embedding augmented by a dense distance field DF^s and a segmentation Transformer that labels groups of strokes autoregressively using group codes and context, with losses L_em = L_recon + γL_dis and a focal loss for segmentation. The method yields state-of-the-art results on SPG and CreativeSketch and provides practical insights into cross-category training and semantic-aware data augmentation to address data imbalance. Overall, the work demonstrates that combining stroke-aware embeddings with group-based, context-driven decoding significantly improves fine-grained sketch segmentation and offers guidance for robust, cross-domain sketch understanding.

Abstract

Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.
Paper Structure (19 sections, 6 equations, 14 figures, 6 tables)

This paper contains 19 sections, 6 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Given an input sketch, semantic segmentation is to assign labels to strokes based on their semantics so as to form semantic groups. Our method is robust to stroke variations achieving superior results (e.g., the correctly labeled airplane windows).
  • Figure 2: Overview of ContextSeg. Given an input sketch, it is first divided into a sequence of strokes, which are used to train our stroke embedding network -- an autoencoder with an extra distance field output (Sec. \ref{['subsec:stroke_embedding']}). Then, the learned embeddings are sent to the segmentation Transformer operating in an auto-regressive manner (Sec. \ref{['subsec:seg_trans']}). The Transformer leverages contextual information, encompassing previously labeled strokes and remaining strokes, as input for the current step's stroke labeling.
  • Figure 3: Stroke distance field. (a) Given an arbitrary point $p$ in the image, we calculate the shortest Euclidean distance from $p$ to the point $t$ on the stroke. (b) The distance curves with three different $k$ values. (c) Distance field maps of three typical $k$ values.
  • Figure 4: Visual comparison with three competitors on the SPG and the CreativeSketch datasets.
  • Figure 5: Sketch reconstruction results of our ablation study on different stroke embedding networks.
  • ...and 9 more figures