Table of Contents
Fetching ...

SPAD : Spatially Aware Multiview Diffusers

Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin

TL;DR

Problem: achieving robust, multi-view, 3D-consistent image generation from text or a single image using diffusion models. Approach: SPAD extends a pretrained 2D diffusion model with cross-view Epipolar Attention and Plücker Ray Embeddings, trained on a filtered Objaverse subset, to enable arbitrary camera control and 3D consistency; supports text- and image-conditioned generation and 3D-aware text-to-3D via SDS and fast NeRF/Triplane pipelines. Contributions: a geometry-constrained multi-view diffusion framework, demonstrated improvements in 3D consistency and image quality on unseen objects and Google Scanned Objects, with ablations validating Epipolar and Plücker components. Significance: provides a scalable, high-quality pathway for 3D asset creation from natural language or a single image, with practical implications for rapid 3D content generation and editing.

Abstract

We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad

SPAD : Spatially Aware Multiview Diffusers

TL;DR

Problem: achieving robust, multi-view, 3D-consistent image generation from text or a single image using diffusion models. Approach: SPAD extends a pretrained 2D diffusion model with cross-view Epipolar Attention and Plücker Ray Embeddings, trained on a filtered Objaverse subset, to enable arbitrary camera control and 3D consistency; supports text- and image-conditioned generation and 3D-aware text-to-3D via SDS and fast NeRF/Triplane pipelines. Contributions: a geometry-constrained multi-view diffusion framework, demonstrated improvements in 3D consistency and image quality on unseen objects and Google Scanned Objects, with ablations validating Epipolar and Plücker components. Significance: provides a scalable, high-quality pathway for 3D asset creation from natural language or a single image, with practical implications for rapid 3D content generation and editing.

Abstract

We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad
Paper Structure (23 sections, 7 equations, 13 figures, 4 tables)

This paper contains 23 sections, 7 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Consistent multi-view generation from text with SPAD. Given a text prompt, SPAD is capable of synthesizing many 3D consistent images of the same object, ranging from daily objects to highly complex machines. SPAD can generate many images from arbitrary camera viewpoints, while being trained only on four views. Here, we generate eight views sampled uniformly at a fixed elevation.
  • Figure 2: Model pipeline. (a) We initialize our multi-view diffusion model from pre-trained text-to-image model, and fine-tune it on multi-view renders of 3D objects. (b) Our model performs joint denoising on noisy multi-view images $\{\bm{x}^i_t\}_{i=1}^N$ conditioned on text $\bm{y}$ and relative camera poses $\Delta \bm{E}$. Here, we illustrate the pipeline using $N=2$, which can be easily extended to more views. To enable cross-view interaction, we apply 3D self-attention by concatenating all views, and enforce epipolar constraints on the attention map. We further add (c) Plücker Embedding $\{\bm{P}^i\}_{i=1}^N$ to the attention layers as positional encodings, to enable precise camera control and prevent object flipping artefacts (as shown in \ref{['fig:ablation-qual']}).
  • Figure 3: Epipolar Attention. For each point $s$ ( red point) on a feature map $\bm{F}^i$, we compute its epipolar lines $\{l^{j}\}_{j \ne i}$ on all other views $\{\bm{F}^j\}_{j \ne i}$. Point $s$ will only attend to features along these lines plus all the points on itself ( blue points).
  • Figure 4: Illustration of one block in our multi-view diffusion model, which consists of a residual block, a self-attention layer, and a cross-attention layer. The residual block guides the model on the denoising timestep $t$ and the relative camera pose $\Delta \bm{E}$, while the cross-attention layer conditions on text $\bm{y}$. We add Plücker Embedding $\bm{P}$ to feature maps $\bm{F}$ in the self-attention layer by inflating the original QKV projection layers with zero projections.
  • Figure 5: Qualitative comparison between SPAD and its variants. We prompt models trained on two views to generate four views at 90 degree intervals for clear visual distinctions. The flipped predicted views are highlighted with red circles, while the content-copying issues are indicated by blue circles.
  • ...and 8 more figures