CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
Hwan Heo, Jangyeong Kim, Seongyeong Lee, Jeong A Wi, Junyoung Choi, Sangjun Ahn
TL;DR
CaPa tackles the challenge of fast, high-fidelity 3D asset generation by decoupling geometry and texture synthesis into a two-stage carve-and-paint pipeline. It employs a multi-view guided 3D latent diffusion model to produce a mesh-friendly occupancy field, followed by a texture synthesis stage that uses Spatially Decoupled Cross Attention to generate 4K textures in a model-agnostic diffusion framework, plus a 3D-aware occlusion inpainting module to fill unseen regions. The approach delivers state-of-the-art texture fidelity and geometric stability while achieving end-to-end generation in under 30 seconds, addressing the Janus problem and occlusion-related seams without additional training. CaPa’s design is highly scalable, integrating with pre-trained diffusion models (e.g., SDXL) and tools like ControlNet and LoRA, enabling practical, commercial-grade 3D asset production.
Abstract
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
