Table of Contents
Fetching ...

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen

TL;DR

ShapeGPT tackles the absence of a unified, instruction-driven model for 3D shapes by integrating 3D shapes, images, and text into a single multimodal language model. It discretizes shapes via a 3D VQ-VAE into tokens, maps them to shape words and shape sentences within a word-sentence-paragraph framework, and aligns with a pretrained language model through a three-stage training regime. Experiments on ShapeNet and Objaverse show competitive performance across text-to-shape, image-to-shape, shape-to-text, and multimodal-to-shape tasks, with ablations highlighting the importance of token length, model size, and pretraining. The approach enables versatile, instruction-based generation and editing of 3D shapes, with potential impact on design, manufacturing, and virtual environments, and points to future extensions to more modalities and dynamic scenes.

Abstract

The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

TL;DR

ShapeGPT tackles the absence of a unified, instruction-driven model for 3D shapes by integrating 3D shapes, images, and text into a single multimodal language model. It discretizes shapes via a 3D VQ-VAE into tokens, maps them to shape words and shape sentences within a word-sentence-paragraph framework, and aligns with a pretrained language model through a three-stage training regime. Experiments on ShapeNet and Objaverse show competitive performance across text-to-shape, image-to-shape, shape-to-text, and multimodal-to-shape tasks, with ablations highlighting the importance of token length, model size, and pretraining. The approach enables versatile, instruction-based generation and editing of 3D shapes, with potential impact on design, manufacturing, and virtual environments, and points to future extensions to more modalities and dynamic scenes.

Abstract

The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.
Paper Structure (10 sections, 1 equation, 5 figures, 5 tables)

This paper contains 10 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustrative Instances of ShapeGPT. We present ShapeGPT, a unified generative framework for various shape-centric multimodal tasks, according to the provided instructions without perceptible transition in task handling. The blue shapes are our generated results.
  • Figure 2: The overview of the framework. ShapeGPT consists of two parts, Multimodal Corpus Construction to tokenize multimodal inputs for corpus collection (\ref{['sec:method:corpus']}) and Shape-aware Multimodal Language Model to comprehend vision-shape-language grammar (\ref{['sec:method:lm']}) aiming at diverse shape-relevant generations, including image-to-shape, text-to-shape, shape-to-text, shape completion, and shape editing.
  • Figure 3: Training Scheme. We introduce three training steps for our ShapeGPT (\ref{['sec:method: strategy']}): First we learn a shape codebook for discrete shape representation. Then we align Vision-Shape-Language models using a mixture of multimodal corpus to comprehend semantic coupling among these modalities. Finally, we fine-tune ShapeGPT with diverse instructions for shape-relevant tasks.
  • Figure 4: Qualitative comparison of the state-of-the-art methods. We provide these generated shape results alongside ground truth references from text-to-shape, image-to-shape, and multimodal-to-shape processes (which combine various modalities). We observe that our generated shapes more accurately align with the multimodal prompts.
  • Figure 5: More results on shape-relevant generation tasks, including shape captioning, editing, reasoning, and completion. The blue shapes and texts are our generation, the orange is the inputs.