CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Longwen Zhang; Ziyu Wang; Qixuan Zhang; Qiwei Qiu; Anqi Pang; Haoran Jiang; Wei Yang; Lan Xu; Jingyi Yu

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, Jingyi Yu

TL;DR

CLAY introduces a scalable, controllable 3D asset generator that unifies geometry and material synthesis under a pretrain-then-adapt paradigm. By combining a multi-resolution VAE for geometry with a latent diffusion transformer and a data-standardization pipeline, it achieves high-fidelity 3D surfaces and 2K-resolution PBR textures with multi-modal conditioning. The framework supports text, images, and diverse 3D inputs (voxels, point clouds, bounding boxes, etc.), enabling rapid production of production-ready assets with strong geometric fidelity and material realism. Extensive quantitative and user studies show CLAY’s advantages in geometry quality, appearance, and generation speed over state-of-the-art methods, underscoring its potential to democratize high-quality 3D content creation. Ethical considerations and limitations are discussed, with future work focusing on end-to-end integration and dynamic content generation to broaden applicability.

Abstract

In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

TL;DR

Abstract

Paper Structure (40 sections, 5 equations, 15 figures, 5 tables)

This paper contains 40 sections, 5 equations, 15 figures, 5 tables.

Introduction
Related Work
Imposing 2D Images as Prior
Imposing 3D Geometry as Priors
Large-scale 3D Generative Model
Representation and Model Architecture
Multi-resolution VAE
Coarse-to-fine DiT
Scaling-up Scheme
Data Standardization for Pretraining
Geometry Unification
Geometry Annotation
Asset Enhancement
Mesh Quadrification and Atlasing
Material Synthesis
...and 25 more sections

Figures (15)

Figure 1: An overview of our CLAY framework for 3D generation. Central to the framework is a large generative model trained on extensive 3D data, capable of transforming textual descriptions into detailed 3D geometries. The model is further enhanced by physically-based material generation and versatile modal adaptation, to enable the creation of 3D assets from diverse concepts and ensure their realistic rendering in digital environments.
Figure 2: Network design of our VAE and DiT. With a minimalist design, our DiT supports scalable training and VAE operates effectively across various geometric resolutions.
Figure 3: Comparison against existing mesh preprocessing methods using cross-sectional analysis. The input is a non-watertight chair with its surface not closed. Red lines correspond to the faces of meshes, light gray indicates "outside" and dark gray indicates "inside". Our method maximizes positive volume while faithfully preserving geometric features. This robustness extends to non-watertight input meshes, ensuring consistent and reliable results.
Figure 4: Our Material Diffusion architecture and Asset Enhancement pipeline. Our Material Diffusion network, derived from existing diffusion models, facilitates efficient fine-tuning. Following mesh quadrification and atlasing, it generates textures through a multi-view approach and subsequently back-projecte them onto UV maps. The resultant materials, closely aligned with geometries and user inputs (text/image), faithfully respond to diverse lighting conditions, culminating in realistic renderings.
Figure 5: Generation after LoRA fine-tuning on different specific datasets including the rock dataset and the pocket monster dataset. After generating a LEGO duck (center), which was one of the first toys designed by LEGO founder Ole Kirk Kristiansen, CLAY can further generate variants in stone styles (left) and pocket monster styles (right).
...and 10 more figures

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

TL;DR

Abstract

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Authors

TL;DR

Abstract

Table of Contents

Figures (15)