Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation
Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang
TL;DR
This work addresses the limitations of single-view prompts in image-to-3D generation by introducing MultiImageDream, which extends the state-of-the-art ImageDream to accept multiple image prompts via multi-view diffusion. The method adds multi-image local and pixel controllers that incorporate $N$ prompts through cross-attention and stacked pixel latents, without requiring fine-tuning. Empirical results show improved diffusion quality and competitive 3D rendering performance for several multi-image configurations, with qualitative evidence of reduced whitening and better texture across viewpoints. The approach highlights the potential of multi-view image guidance for robust 3D generation and points to future work on finetuning with multi-image prompts and ensuring cross-view 3D consistency.
Abstract
Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.
