ViSketch-GPT: Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation
Giulio Federico, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Marco Di Benedetto
TL;DR
The paper addresses the challenge of recognizing and generating human sketches amid high stylistic variability. It introduces ViSketch-GPT, a two-stage, coarse-to-fine approach that uses a quadtree-based multi-scale context to collaboratively refine patch-level details via a Transformer decoder with a VQ-VAE tokenizer, and employs a Signed Distance Field representation to better handle sparse sketch data. The method models $p(S'|c)$ in a reduced-resolution stage and $p(S|S',c)$ in a refinement stage, with a context-aware leaf predictor $p(l_s^{(i)}|\ualeph(ar{l}_{ar{s}^{(i)}}),c)$ that ensures coherent reconstruction across a hierarchical patch structure. On the QuickDraw dataset, ViSketch-GPT achieves state-of-the-art results in both sketch generation and classification, highlighting the benefits of multi-scale feature collaboration for complex visual patterns and offering a robust framework for geometry-aware sketch understanding with potential impact in AI-assisted creativity and vision tasks.
Abstract
Understanding the nature of human sketches is challenging because of the wide variation in how they are created. Recognizing complex structural patterns improves both the accuracy in recognizing sketches and the fidelity of the generated sketches. In this work, we introduce ViSketch-GPT, a novel algorithm designed to address these challenges through a multi-scale context extraction approach. The model captures intricate details at multiple scales and combines them using an ensemble-like mechanism, where the extracted features work collaboratively to enhance the recognition and generation of key details crucial for classification and generation tasks. The effectiveness of ViSketch-GPT is validated through extensive experiments on the QuickDraw dataset. Our model establishes a new benchmark, significantly outperforming existing methods in both classification and generation tasks, with substantial improvements in accuracy and the fidelity of generated sketches. The proposed algorithm offers a robust framework for understanding complex structures by extracting features that collaborate to recognize intricate details, enhancing the understanding of structures like sketches and making it a versatile tool for various applications in computer vision and machine learning.
