Table of Contents
Fetching ...

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

Seunggwan Lee, Hwanhee Jung, Byoungsoo Koh, Qixing Huang, Sangho Yoon, Sangpil Kim

TL;DR

PASTA tackles the challenge of minimizing information loss in sketch-based 3D generation by fusing hand-drawn sketches with text-aligned priors from vision-language models. It uses a Text-Visual Transformer Decoder to fuse visual cues with text embeddings, and ISG-Net with IndivGCN and PartGCN to refine fine-grained and part-level structure. Shapes are represented with a SPAGHETTI-based Gaussian mixture model consisting of $N$ components, and parts are organized via hierarchical clustering into $K$ groups for guidance. The model is trained with a multi-term loss that aligns latent vectors with the SPAGHETTI space and enforces both local detail and global coherence. Experiments on chair, airplane, lamp, and real-image data show state-of-the-art performance and enable intuitive part-level editing with robust generalization.

Abstract

A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

TL;DR

PASTA tackles the challenge of minimizing information loss in sketch-based 3D generation by fusing hand-drawn sketches with text-aligned priors from vision-language models. It uses a Text-Visual Transformer Decoder to fuse visual cues with text embeddings, and ISG-Net with IndivGCN and PartGCN to refine fine-grained and part-level structure. Shapes are represented with a SPAGHETTI-based Gaussian mixture model consisting of components, and parts are organized via hierarchical clustering into groups for guidance. The model is trained with a multi-term loss that aligns latent vectors with the SPAGHETTI space and enforces both local detail and global coherence. Experiments on chair, airplane, lamp, and real-image data show state-of-the-art performance and enable intuitive part-level editing with robust generalization.

Abstract

A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.

Paper Structure

This paper contains 29 sections, 14 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Sample images from the datasets used in our experiments, including CLIPasso vinker2022clipasso, non-photo-realistic renderings chan2022learning, AmateurSketch-3D qi2021toward, and ProSketch-3D zhong2020towards. These datasets capture a range of sketching styles, from highly abstract representations to detailed, expert-drawn sketches.
  • Figure 2: Overview of PASTA. Our framework enhances sketch-based 3D shape generation by integrating visual embeddings and text-aligned priors. A visual backbone and vision-language model (VLM) extract meaningful features from an input sketch, which are then processed by a Text-Visual Transformer Decoder with learnable queries. To refine structural details, we introduce ISG-Net, which consists of IndivGCN for fine-grained feature processing and PartGCN for aggregating part-level information. The output features are fed into the SPAGHETTI shape decoder hertz2022spaghetti, producing a more complete and structurally accurate 3D model.
  • Figure 2: Illustration of sketches processed using ControlNet zhang2023adding. The transformed sketches retain the structural essence of the originals while incorporating enhanced realism.
  • Figure 3: An example of the prompts we used for chair sketches and their various corresponding text descriptions from the VLM liu2023visual. The figure illustrates how the generated text descriptions emphasize key semantic details of the sketch.
  • Figure 3: Comparison of two architectures for the Text-Visual Transformer Decoder: (a) sequential cross-attention and (b) parallel cross-attention mechanisms.
  • ...and 12 more figures