Table of Contents
Fetching ...

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman

TL;DR

This work tackles the persistent issue of multi-view inconsistency and NeRF reconstruction artifacts in text-to-3D diffusion models, arising from limited 3D data for supervised fine-tuning. It introduces Carve3D, an enhanced reinforcement learning finetuning (RLFT) framework guided by a novel Multi-view Reconstruction Consistency (MRC) metric that compares diffusion outputs to NeRF renderings from the same viewpoints. By adopting a purely on-policy RLFT with KL regularization and scaling laws, Carve3D achieves superior multi-view consistency and NeRF quality (Carve3DM) while preserving prompt alignment and texture details, outperforming longer SFT baselines and existing baselines. The approach demonstrates that combining SFT with Carve3D RLFT is essential for robust multi-view diffusion models and provides a scalable path toward more reliable 3D generation from text.

Abstract

Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

TL;DR

This work tackles the persistent issue of multi-view inconsistency and NeRF reconstruction artifacts in text-to-3D diffusion models, arising from limited 3D data for supervised fine-tuning. It introduces Carve3D, an enhanced reinforcement learning finetuning (RLFT) framework guided by a novel Multi-view Reconstruction Consistency (MRC) metric that compares diffusion outputs to NeRF renderings from the same viewpoints. By adopting a purely on-policy RLFT with KL regularization and scaling laws, Carve3D achieves superior multi-view consistency and NeRF quality (Carve3DM) while preserving prompt alignment and texture details, outperforming longer SFT baselines and existing baselines. The approach demonstrates that combining SFT with Carve3D RLFT is essential for robust multi-view diffusion models and provides a scalable path toward more reliable 3D generation from text.

Abstract

Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.
Paper Structure (50 sections, 7 equations, 17 figures, 1 table)

This paper contains 50 sections, 7 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Our Carve3D algorithm steadily improves the 3D consistency of a multi-view diffusion model and the resulting quality of the NeRF and the mesh, without sacrificing its image-prompt alignment, texture details, or realism. Here, we show 3 testing-set results (in 3 rows, numbered as 1-3, separated by dotted lines) from the finetuning process (epoch 0, 28, and 55 in 3 columns). Each row includes the generated multi-view images (denoted as MV), the reconstructed NeRF and extracted mesh (denoted as RM) and the text prompt (denoted as TP). The inconsistencies in the multi-view images, e.g. the facing direction of the shopping cart, the position of the octopus arms, and the position of the pencils, lead to artifacts in the NeRF and the mesh (highlighted in red).
  • Figure 2: Overview of Carve3D. Given a prompt sampled from our curated prompt set and a initial noisy image, we iteratively denoise the image using the UNet. The final, clean image contains four multi-view images tiled in a 2-by-2 grid. MRC reward is computed by comparing (a) the generated multi-view images with (c) the corresponding multi-view images rendered at the same camera viewpoints from (b) the reconstructed NeRF. Then, we train the model with policy gradient loss function, where the loss is derived from the reward and log probabilities of the UNet's predictions, accumulated over all denoising timesteps. By using only a set of training text prompts, our RLFT algorithm finetunes the diffusion model by evaluating its own generated outputs, without relying on ground truth multi-view images.
  • Figure 3: Qualitative correlation between MRC and multi-view inconsistency with increasing intensity, introduced by inpainting with increasing mask sizes. Left: the four ground truth views. Right: the 4th view is inpainted with increasing area sizes, i.e. 0$\times$0, 50$\times$50 and 110$\times$110 pixels. The top row is the image after inpainting and the bottom row is the image rendered from the NeRF reconstructed with the top inpainted 4th view and the other 3 original GT views. We mark the inpainting area with blue and red boxes. Since the lion's right paw in the inpainted 4th views look different from the other three original views, its shape is broken in the NeRF and the rendered views. This difference is captured in MRC's image dissimilarity metric.
  • Figure 4: Quantitative correlation between MRC and multi-view inconsistency with increasing intensity, for the object shown in \ref{['fig:method:consistency']}. As inconsistency intensity rises, MRC also monotonically increases.
  • Figure 5: Comparing the IS and the SF versions of Carve3D reward curves on the testing set over 4 different random seeds. The IS version produces reward curves with high variance, including two runs that fails. In contrast, all runs of the SF version stably produces reward curves with low variance.
  • ...and 12 more figures