Table of Contents
Fetching ...

ConTEXTure: Consistent Multiview Images to Texture

Jaehoon Ahn, Sumin Cho, Harim Jung, Kibeom Hong, Seonghoon Ban, Moon-Ryul Jung

TL;DR

ConTEXTure presents a depth-guided diffusion framework that resolves viewpoint bias in multi-view texture atlas generation by combining front-view conditioning with simultaneous multi-view inverse rendering. Leveraging SD2-depth for the front view and Zero123++ with Depth ControlNet to synthesize six consistent viewpoints, the method learns a single texture atlas via a two-stage process that incorporates a meta-texture and view-weights to blend contributions across views. The approach yields substantially improved viewpoint consistency and faster runtimes compared to prior work, as evidenced by quantitative metrics (e.g., a dramatic reduction in FID) and user studies, while highlighting challenges in lighting and residual color artifacts. These results advance practical texture synthesis for 3D meshes by enabling coherent, view-consistent textures across front, back, and oblique viewpoints, with implications for real-time rendering and interactive design.

Abstract

We introduce ConTEXTure, a generative network designed to create a texture map/atlas for a given 3D mesh using images from multiple viewpoints. The process begins with generating a front-view image from a text prompt, such as 'Napoleon, front view', describing the 3D mesh. Additional images from different viewpoints are derived from this front-view image and camera poses relative to it. ConTEXTure builds upon the TEXTure network, which uses text prompts for six viewpoints (e.g., 'Napoleon, front view', 'Napoleon, left view', etc.). However, TEXTure often generates images for non-front viewpoints that do not accurately represent those viewpoints.To address this issue, we employ Zero123++, which generates multiple view-consistent images for the six specified viewpoints simultaneously, conditioned on the initial front-view image and the depth maps of the mesh for the six viewpoints. By utilizing these view-consistent images, ConTEXTure learns the texture atlas from all viewpoint images concurrently, unlike previous methods that do so sequentially. This approach ensures that the rendered images from various viewpoints, including back, side, bottom, and top, are free from viewpoint irregularities.

ConTEXTure: Consistent Multiview Images to Texture

TL;DR

ConTEXTure presents a depth-guided diffusion framework that resolves viewpoint bias in multi-view texture atlas generation by combining front-view conditioning with simultaneous multi-view inverse rendering. Leveraging SD2-depth for the front view and Zero123++ with Depth ControlNet to synthesize six consistent viewpoints, the method learns a single texture atlas via a two-stage process that incorporates a meta-texture and view-weights to blend contributions across views. The approach yields substantially improved viewpoint consistency and faster runtimes compared to prior work, as evidenced by quantitative metrics (e.g., a dramatic reduction in FID) and user studies, while highlighting challenges in lighting and residual color artifacts. These results advance practical texture synthesis for 3D meshes by enabling coherent, view-consistent textures across front, back, and oblique viewpoints, with implications for real-time rendering and interactive design.

Abstract

We introduce ConTEXTure, a generative network designed to create a texture map/atlas for a given 3D mesh using images from multiple viewpoints. The process begins with generating a front-view image from a text prompt, such as 'Napoleon, front view', describing the 3D mesh. Additional images from different viewpoints are derived from this front-view image and camera poses relative to it. ConTEXTure builds upon the TEXTure network, which uses text prompts for six viewpoints (e.g., 'Napoleon, front view', 'Napoleon, left view', etc.). However, TEXTure often generates images for non-front viewpoints that do not accurately represent those viewpoints.To address this issue, we employ Zero123++, which generates multiple view-consistent images for the six specified viewpoints simultaneously, conditioned on the initial front-view image and the depth maps of the mesh for the six viewpoints. By utilizing these view-consistent images, ConTEXTure learns the texture atlas from all viewpoint images concurrently, unlike previous methods that do so sequentially. This approach ensures that the rendered images from various viewpoints, including back, side, bottom, and top, are free from viewpoint irregularities.
Paper Structure (18 sections, 2 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 18 sections, 2 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Four examples of texture maps generated using the ConTEXTure model, being worn by its mesh. Each model is shown from three equidistant azimuth angles of 0°, 120°, and -120°, offset from the front. The prompts used to generate each texture are written directly beneath the images.
  • Figure 3: A visual representation of the ConTEXTure model's generation process. A latent image blending process is performed in our custom implementation of Zero123++ which is shown in more detail in Figure \ref{['fig:zero123plus_blending']}. Refer to Algorithm \ref{['alg:contexture']} for more information on the overall texture generation process.
  • Figure 4: Our usage of Zero123++ features a custom diffusion process that was inspired by the implementation of richardson2023texture. We perform blending of the latent image $z_{grid,t-1}$ with noisy ground truth latent $z_{Q_{grid,t-1}}$ after each denoising iteration $t-1$. Performing blending according to the blending mask $m_{grid}$ ensures that the front view image that was already projected back unto the texture atlas is not regenerated by the Zero123++ model.
  • Figure 5: There are slight misalignments among the novel view images by Zero123++ which result in a low quality texture when projected onto the mesh without further postprocessing. The blending technique, used in richardson2023texture for preventing already-generated regions of the image from being overwritten, continues to prove effective in ConTEXTure. Prompt of "A photo of Spiderman, front view" was used for both textures.
  • Figure 7: When using the front- and back-side depth maps on the person mesh using the prompt "white humanoid robot, movie poster, villain character of a science fiction movie," the viewpoint bias issue manifests in the generation of eyes on both the front and back side of the head.
  • ...and 6 more figures