Table of Contents
Fetching ...

ViSketch-GPT: Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation

Giulio Federico, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Marco Di Benedetto

TL;DR

The paper addresses the challenge of recognizing and generating human sketches amid high stylistic variability. It introduces ViSketch-GPT, a two-stage, coarse-to-fine approach that uses a quadtree-based multi-scale context to collaboratively refine patch-level details via a Transformer decoder with a VQ-VAE tokenizer, and employs a Signed Distance Field representation to better handle sparse sketch data. The method models $p(S'|c)$ in a reduced-resolution stage and $p(S|S',c)$ in a refinement stage, with a context-aware leaf predictor $p(l_s^{(i)}|\ualeph(ar{l}_{ar{s}^{(i)}}),c)$ that ensures coherent reconstruction across a hierarchical patch structure. On the QuickDraw dataset, ViSketch-GPT achieves state-of-the-art results in both sketch generation and classification, highlighting the benefits of multi-scale feature collaboration for complex visual patterns and offering a robust framework for geometry-aware sketch understanding with potential impact in AI-assisted creativity and vision tasks.

Abstract

Understanding the nature of human sketches is challenging because of the wide variation in how they are created. Recognizing complex structural patterns improves both the accuracy in recognizing sketches and the fidelity of the generated sketches. In this work, we introduce ViSketch-GPT, a novel algorithm designed to address these challenges through a multi-scale context extraction approach. The model captures intricate details at multiple scales and combines them using an ensemble-like mechanism, where the extracted features work collaboratively to enhance the recognition and generation of key details crucial for classification and generation tasks. The effectiveness of ViSketch-GPT is validated through extensive experiments on the QuickDraw dataset. Our model establishes a new benchmark, significantly outperforming existing methods in both classification and generation tasks, with substantial improvements in accuracy and the fidelity of generated sketches. The proposed algorithm offers a robust framework for understanding complex structures by extracting features that collaborate to recognize intricate details, enhancing the understanding of structures like sketches and making it a versatile tool for various applications in computer vision and machine learning.

ViSketch-GPT: Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation

TL;DR

The paper addresses the challenge of recognizing and generating human sketches amid high stylistic variability. It introduces ViSketch-GPT, a two-stage, coarse-to-fine approach that uses a quadtree-based multi-scale context to collaboratively refine patch-level details via a Transformer decoder with a VQ-VAE tokenizer, and employs a Signed Distance Field representation to better handle sparse sketch data. The method models in a reduced-resolution stage and in a refinement stage, with a context-aware leaf predictor that ensures coherent reconstruction across a hierarchical patch structure. On the QuickDraw dataset, ViSketch-GPT achieves state-of-the-art results in both sketch generation and classification, highlighting the benefits of multi-scale feature collaboration for complex visual patterns and offering a robust framework for geometry-aware sketch understanding with potential impact in AI-assisted creativity and vision tasks.

Abstract

Understanding the nature of human sketches is challenging because of the wide variation in how they are created. Recognizing complex structural patterns improves both the accuracy in recognizing sketches and the fidelity of the generated sketches. In this work, we introduce ViSketch-GPT, a novel algorithm designed to address these challenges through a multi-scale context extraction approach. The model captures intricate details at multiple scales and combines them using an ensemble-like mechanism, where the extracted features work collaboratively to enhance the recognition and generation of key details crucial for classification and generation tasks. The effectiveness of ViSketch-GPT is validated through extensive experiments on the QuickDraw dataset. Our model establishes a new benchmark, significantly outperforming existing methods in both classification and generation tasks, with substantial improvements in accuracy and the fidelity of generated sketches. The proposed algorithm offers a robust framework for understanding complex structures by extracting features that collaborate to recognize intricate details, enhancing the understanding of structures like sketches and making it a versatile tool for various applications in computer vision and machine learning.

Paper Structure

This paper contains 10 sections, 21 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: An example of the task we aim to tackle. Starting from the class label, we aim to generate a sketch belonging to that class.
  • Figure 2: Overview of the two stages: the first stage operates at a very low resolution to simplify and accelerate modeling; the second stage generates plausible details in a scalable manner.
  • Figure 3: First step of the generative refinement pipeline. Given the output of stage 1, $S'$ is resized to the original resolution $\hat{S}$ and the quadtree is computed.
  • Figure 4: The second step of the generative refinement pipeline. Copy the quadtree of $\hat{S}$ into$S$.
  • Figure 5: Process of creating the context of a leaf. Starting from the target leaf, the 3x3 tiles around it are taken with the leaf in the center. The same is done with the parent of the leaf until we reach the root itself. Each tile, regardless of the level, has the same resolution.
  • ...and 5 more figures