Table of Contents
Fetching ...

Text-guided Controllable Mesh Refinement for Interactive 3D Modeling

Yun-Chun Chen, Selena Ling, Zhiqin Chen, Vladimir G. Kim, Matheus Gadelha, Alec Jacobson

TL;DR

A novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt that offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.

Abstract

We propose a novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt. Our method is composed of three stages. First, we generate a single-view RGB image conditioned on the input coarse geometry and the input text prompt. This single-view image generation step allows the user to pre-visualize the result and offers stronger conditioning for subsequent multi-view generation. Second, we use our novel multi-view normal generation architecture to jointly generate six different views of the normal images. The joint view generation reduces inconsistencies and leads to sharper details. Third, we optimize our mesh with respect to all views and generate a fine, detailed geometry as output. The resulting method produces an output within seconds and offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.

Text-guided Controllable Mesh Refinement for Interactive 3D Modeling

TL;DR

A novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt that offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.

Abstract

We propose a novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt. Our method is composed of three stages. First, we generate a single-view RGB image conditioned on the input coarse geometry and the input text prompt. This single-view image generation step allows the user to pre-visualize the result and offers stronger conditioning for subsequent multi-view generation. Second, we use our novel multi-view normal generation architecture to jointly generate six different views of the normal images. The joint view generation reduces inconsistencies and leads to sharper details. Third, we optimize our mesh with respect to all views and generate a fine, detailed geometry as output. The resulting method produces an output within seconds and offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.
Paper Structure (25 sections, 6 equations, 11 figures, 2 tables)

This paper contains 25 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Insight. Using a depth/normal image as a control mechanism in large text-to-image models usually leads to images that contain more details than the ones present in the original shape. Those details are so prominent that can even be captured by off-the-shelf shape estimation models.
  • Figure 2: Method overview. Our method consists of three stages: single-view generation, multi-view generation and mesh refinement/optimization. Given an input mesh and an input text prompt, we first use a large-scale pre-trained diffusion model (highlighted in red) to generate an RGB image that respects the input conditions. Next, we use a multi-view diffusion model (highlighted in blue) that takes as input the generated RGB image and the normal renderings of the input mesh and generates multi-view normals. Finally, we use the generated multi-view normals to supervise the refinement of the input mesh.
  • Figure 3: Influence of guidance strength. The user can control how much detail can be generated by the system when setting the number of backward diffusion steps where the guidance will be used. This control is available during the single-view (top) and multi-view (bottom) stages.
  • Figure 4: Qualitative results. Our method generates 3D meshes that have better geometric details and visual quality compared to state-of-the-art methods.
  • Figure 5: Multi-view control. Our method is capable of generating details using the full initial shape as guidance. Notice how the back legs of the cat and its tail follow the input coarse mesh (in green). On the other hand, Wonder3D results yield reasonable renditions when visualized from the initial viewpoint $\theta_s$ (top row) but clearly fail to follow the coarse geometric guidance when seen from other views (bottom row).
  • ...and 6 more figures