CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Hwan Heo; Jangyeong Kim; Seongyeong Lee; Jeong A Wi; Junyoung Choi; Sangjun Ahn

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Hwan Heo, Jangyeong Kim, Seongyeong Lee, Jeong A Wi, Junyoung Choi, Sangjun Ahn

TL;DR

CaPa tackles the challenge of fast, high-fidelity 3D asset generation by decoupling geometry and texture synthesis into a two-stage carve-and-paint pipeline. It employs a multi-view guided 3D latent diffusion model to produce a mesh-friendly occupancy field, followed by a texture synthesis stage that uses Spatially Decoupled Cross Attention to generate 4K textures in a model-agnostic diffusion framework, plus a 3D-aware occlusion inpainting module to fill unseen regions. The approach delivers state-of-the-art texture fidelity and geometric stability while achieving end-to-end generation in under 30 seconds, addressing the Janus problem and occlusion-related seams without additional training. CaPa’s design is highly scalable, integrating with pre-trained diffusion models (e.g., SDXL) and tools like ControlNet and LoRA, enabling practical, commercial-grade 3D asset production.

Abstract

The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

TL;DR

Abstract

Paper Structure (34 sections, 7 equations, 15 figures, 3 tables)

This paper contains 34 sections, 7 equations, 15 figures, 3 tables.

Introduction
Related Work
3D Asset Generation
3D-Native Reconstruction Models
Texture Generation
Methodology
Geometry Generation via 3D Latent Diffusion
Latent Space for Geometry Representation
Multi-View Guided 3D Latent Diffusion
Texture Generation for Input Geometry
Spatially Decoupled Cross Attention
Occlusion Inpainting
Experiments
Implementation Details
Qualitative Comparison
...and 19 more sections

Figures (15)

Figure 1: Comparison of mesh quality with state-of-the-art image-to-3D methods. CaPa can generate a hyper-quality textured mesh in under 30 seconds, providing 3D assets ready for commercial applications such as games, movies, and VR/AR.
Figure 2: CaPa pipeline. We first generate 3D geometry using a 3D latent diffusion model. Using the learned 3D latent space with ShapeVAE, we train a 3D Latent Diffusion Model that generates 3D geometries, guided by multi-view images to ensure alignment between the generated shape and texture. After the 3D geometry is created, we render four orthogonal views of the mesh, which serve as inputs for texture generation. To produce a high-quality texture while preventing the Janus problem, we utilize a novel, model-agnostic spatially decoupled attention. Finally, we obtain a hyper-quality textured mesh through back projection and a 3D-aware occlusion inpainting algorithm.
Figure 3: Spatially Decoupled Cross Attention. To produce high-quality multi-view images for a given geometry, we design a model-agnostic Spatially Decoupled Cross Attention. During cross-attention in denoising U-Net, we replicate hidden feature channels so that each duplicated channels focuses solely on the designated view. Since the design is model-agnostic, we can utilize an external ControlNet to guide the textures aligned with the input mesh.
Figure 4: 3D-Aware Occlusion Inpainting. First, we cluster the normal and spatial coordinates of the occluded face. Using clustered centers as viewpoints, we create specialized UV maps through projection mapping. This approach captures surface locality, allowing 2D diffusion-based inpainting to effectively fill occluded regions. Note that this UV map is utilized solely for occlusion.
Figure 5: Comparison of Texturing Method. Unlike prior works, CaPa effectively resolved the Janus problem with consistent ID.
...and 10 more figures

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

TL;DR

Abstract

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)