Table of Contents
Fetching ...

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

TL;DR

Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models, is proposed and extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.

Abstract

Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

TL;DR

Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models, is proposed and extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.

Abstract

Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.
Paper Structure (31 sections, 2 equations, 31 figures, 11 tables)

This paper contains 31 sections, 2 equations, 31 figures, 11 tables.

Figures (31)

  • Figure 1: Bootstrap3D can generate high quality multi-view images with precise long text control and style customization while maintaining view consistency.
  • Figure 2: Bootstrap3D data generation pipeline that consists of 1) using LLM to generate diverse text prompts 2) employing the T2I model to generate single-view images 3) synthesizing arbitrary number of multi-view images by applying the video diffusion model, 4) employing MV-LLaVA to filter and select only high-quality data, and rewrite captions to be dense and descriptive.
  • Figure 3: MV-LLaVA. We use GPT-4V 2023GPT4VisionSC to generate long descriptive captions, quality scoring, and reasoning processes for multi-view images to construct the instruction tuning dataset. Then we fine-tune our MV-LLaVA based on LLaVA liu2024visual to serve as the human-aligned quality checker and captioner for multi-view images.
  • Figure 4: Training Timestep Reschedule (TTR). For different types of training data, we restrict the training time step $t$ accordingly to achieve the balance between varied high aesthetic images that are better aligned with text prompt, photo-realistic texture, and view consistency for 3D generation.
  • Figure 5: Bootstrap3D generates 3D objects compared to other edge-cutting methods given text prompt. More results with higher resolution are available in Sup.\ref{['sup_more_com']}.
  • ...and 26 more figures