ConTEXTure: Consistent Multiview Images to Texture
Jaehoon Ahn, Sumin Cho, Harim Jung, Kibeom Hong, Seonghoon Ban, Moon-Ryul Jung
TL;DR
ConTEXTure presents a depth-guided diffusion framework that resolves viewpoint bias in multi-view texture atlas generation by combining front-view conditioning with simultaneous multi-view inverse rendering. Leveraging SD2-depth for the front view and Zero123++ with Depth ControlNet to synthesize six consistent viewpoints, the method learns a single texture atlas via a two-stage process that incorporates a meta-texture and view-weights to blend contributions across views. The approach yields substantially improved viewpoint consistency and faster runtimes compared to prior work, as evidenced by quantitative metrics (e.g., a dramatic reduction in FID) and user studies, while highlighting challenges in lighting and residual color artifacts. These results advance practical texture synthesis for 3D meshes by enabling coherent, view-consistent textures across front, back, and oblique viewpoints, with implications for real-time rendering and interactive design.
Abstract
We introduce ConTEXTure, a generative network designed to create a texture map/atlas for a given 3D mesh using images from multiple viewpoints. The process begins with generating a front-view image from a text prompt, such as 'Napoleon, front view', describing the 3D mesh. Additional images from different viewpoints are derived from this front-view image and camera poses relative to it. ConTEXTure builds upon the TEXTure network, which uses text prompts for six viewpoints (e.g., 'Napoleon, front view', 'Napoleon, left view', etc.). However, TEXTure often generates images for non-front viewpoints that do not accurately represent those viewpoints.To address this issue, we employ Zero123++, which generates multiple view-consistent images for the six specified viewpoints simultaneously, conditioned on the initial front-view image and the depth maps of the mesh for the six viewpoints. By utilizing these view-consistent images, ConTEXTure learns the texture atlas from all viewpoint images concurrently, unlike previous methods that do so sequentially. This approach ensures that the rendered images from various viewpoints, including back, side, bottom, and top, are free from viewpoint irregularities.
