ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen
TL;DR
ShapeGPT tackles the absence of a unified, instruction-driven model for 3D shapes by integrating 3D shapes, images, and text into a single multimodal language model. It discretizes shapes via a 3D VQ-VAE into tokens, maps them to shape words and shape sentences within a word-sentence-paragraph framework, and aligns with a pretrained language model through a three-stage training regime. Experiments on ShapeNet and Objaverse show competitive performance across text-to-shape, image-to-shape, shape-to-text, and multimodal-to-shape tasks, with ablations highlighting the importance of token length, model size, and pretraining. The approach enables versatile, instruction-based generation and editing of 3D shapes, with potential impact on design, manufacturing, and virtual environments, and points to future extensions to more modalities and dynamic scenes.
Abstract
The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.
