SPAD : Spatially Aware Multiview Diffusers
Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin
TL;DR
Problem: achieving robust, multi-view, 3D-consistent image generation from text or a single image using diffusion models. Approach: SPAD extends a pretrained 2D diffusion model with cross-view Epipolar Attention and Plücker Ray Embeddings, trained on a filtered Objaverse subset, to enable arbitrary camera control and 3D consistency; supports text- and image-conditioned generation and 3D-aware text-to-3D via SDS and fast NeRF/Triplane pipelines. Contributions: a geometry-constrained multi-view diffusion framework, demonstrated improvements in 3D consistency and image quality on unseen objects and Google Scanned Objects, with ablations validating Epipolar and Plücker components. Significance: provides a scalable, high-quality pathway for 3D asset creation from natural language or a single image, with practical implications for rapid 3D content generation and editing.
Abstract
We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad
