Table of Contents
Fetching ...

An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes

Zhengyi Zhao, Chen Song, Xiaodong Gu, Yuan Dong, Qi Zuo, Weihao Yuan, Liefeng Bo, Zilong Dong, Qixing Huang

TL;DR

The paper tackles the challenge of enforcing multi-view texture consistency for 3D meshes generated from text prompts. It introduces a four-stage optimization framework: Stage I generates an over-complete set of RGB-D views with MV-consistent diffusion, Stage II selects a mutually consistent subset via a sequential SDP relaxation while ensuring full mesh coverage, Stage III applies non-rigid, joint alignment (color adjustment on a sparse FFD lattice followed by dense warping using SIFTFlow), and Stage IV stitches textures by solving a second-order MRF to assign mesh faces to views with iterative refinement around stitching cuts. The approach yields significant qualitative and quantitative gains over state-of-the-art methods (e.g., improved photorealism and lower FID), validated on Objaverse models and supported by user studies and ablations. Limitations include incomplete modeling of illumination factors, partial decoupling between pairwise and joint alignments, and higher computational cost, suggesting avenues for end-to-end integration and latent-parameter conditioning in future work.

Abstract

A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively. Project page: https://aigc3d.github.io/ConsistenTex.

An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes

TL;DR

The paper tackles the challenge of enforcing multi-view texture consistency for 3D meshes generated from text prompts. It introduces a four-stage optimization framework: Stage I generates an over-complete set of RGB-D views with MV-consistent diffusion, Stage II selects a mutually consistent subset via a sequential SDP relaxation while ensuring full mesh coverage, Stage III applies non-rigid, joint alignment (color adjustment on a sparse FFD lattice followed by dense warping using SIFTFlow), and Stage IV stitches textures by solving a second-order MRF to assign mesh faces to views with iterative refinement around stitching cuts. The approach yields significant qualitative and quantitative gains over state-of-the-art methods (e.g., improved photorealism and lower FID), validated on Objaverse models and supported by user studies and ablations. Limitations include incomplete modeling of illumination factors, partial decoupling between pairwise and joint alignments, and higher computational cost, suggesting avenues for end-to-end integration and latent-parameter conditioning in future work.

Abstract

A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively. Project page: https://aigc3d.github.io/ConsistenTex.
Paper Structure (16 sections, 7 equations, 9 figures, 1 table)

This paper contains 16 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Given a 3D mesh and a text prompt, we propose an optimization framework to synthesize the multi-view consistent texture. Top: meshes from a 3D generation model. Bottom: artist-created meshes.
  • Figure 2: The overall pipeline of our approach, which consists of four stages. The first stage (view generation) uses a pre-trained text-to-image model to generate an over-complete set of RGB-D images of the input model. The second stage (view selection) selects a subset of consistent RGB-D images from the view generation output. The third stage (view alignment) performs non-rigid warping to further improve the multi-view consistency among the selected images. The last stage (texture stitching) produces cuts between pairs of overlapping images by solving a second-order MRF problem.
  • Figure 3: Three types of multi-view inconsistencies from pre-trained text-to-image models. (a) The appearance of the image content may change significantly, even with slight perturbations in the camera pose (see the bottom row). (b) There are variations in illumination among overlapping regions of different views. (c) There are drifts in the detail of the image between overlapping regions of different views. (The left column images in (b) and (c) are warped with each other.)
  • Figure 4: Illustration of view selection. (Left) The input images ranked in the increasing order of $S(I_i)$. (Right) The selected images that consider the image scores $S(I_i)$, the consistency scores $S(I_i, I_j)$ and the covering constraint.
  • Figure 5: Illustration of the view alignment stage. We show the result of two overlapping images through the joint view alignment procedure. (a) Two input images. (b) After adjustment of the global illumination. (c) From left to right: overlapping regions of each input image, warped images using ground-truth maps between the overlapping regions, overlaid results of the first two columns, and overlaid after SIFTFlow alignment. (d) Overlaid alignments after joint alignment. The pairwise SIFTFlow alignments are preserved.
  • ...and 4 more figures