Meta 3D Gen
Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, Andrea Vedaldi
TL;DR
Meta 3D Gen (3DGen) presents a fast, two-stage pipeline for text-to-3D asset generation that delivers production-quality 3D shapes and textures with PBR in under a minute. By uniting Meta 3D AssetGen (text-to-3D) and Meta 3D TextureGen (text-to-texture) within a unified framework, 3DGen represents objects in view, volumetric, and UV spaces and enables efficient retexturing. Stage I generates the 3D geometry and initial texture, while Stage II refines textures via diffusion-based texture generation and optional super-resolution, achieving a 68% win-rate over single-stage baselines and outperforming industry solutions in prompt fidelity and visual quality for complex prompts. The method enables rapid production-ready assets and coherent retexturing for generated or artist-created meshes, with broad implications for games, AR/VR, and Metaverse content creation.
Abstract
We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.
