Table of Contents
Fetching ...

Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation

Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang

TL;DR

This work addresses the limitations of single-view prompts in image-to-3D generation by introducing MultiImageDream, which extends the state-of-the-art ImageDream to accept multiple image prompts via multi-view diffusion. The method adds multi-image local and pixel controllers that incorporate $N$ prompts through cross-attention and stacked pixel latents, without requiring fine-tuning. Empirical results show improved diffusion quality and competitive 3D rendering performance for several multi-image configurations, with qualitative evidence of reduced whitening and better texture across viewpoints. The approach highlights the potential of multi-view image guidance for robust 3D generation and points to future work on finetuning with multi-image prompts and ensuring cross-view 3D consistency.

Abstract

Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.

Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation

TL;DR

This work addresses the limitations of single-view prompts in image-to-3D generation by introducing MultiImageDream, which extends the state-of-the-art ImageDream to accept multiple image prompts via multi-view diffusion. The method adds multi-image local and pixel controllers that incorporate prompts through cross-attention and stacked pixel latents, without requiring fine-tuning. Empirical results show improved diffusion quality and competitive 3D rendering performance for several multi-image configurations, with qualitative evidence of reduced whitening and better texture across viewpoints. The approach highlights the potential of multi-view image guidance for robust 3D generation and points to future work on finetuning with multi-image prompts and ensuring cross-view 3D consistency.

Abstract

Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.
Paper Structure (8 sections, 2 figures, 2 tables)

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of MultiImageDream. We extend the Local and Pixel controllers proposed in ImageDream wang2023imagedream to support multi-image prompts for improved 3D generation.
  • Figure 2: Qualitative results of MultiImageDream in comparison to ImageDream wang2023imagedream. Note that the multi-view diffusion output of ImageDream are used as the additional image prompts for MultiImageDream. We can observe that using multiple image prompts gives better diffusion / 3D generation outputs, alleviating issues such as excess whitening or lack of details at the back view. Best viewed on electronics, zoom in for better visualization.