Table of Contents
Fetching ...

Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

Justin Theiss, Norman Müller, Daeil Kim, Aayush Prakash

TL;DR

This work tackles the challenge of generating multi-view-consistent images from text prompts by introducing coordinate-based noise initialization, Fourier-based attention (FBA), and a prompt-driven cross-attention loss. The coordinate noise injects low-frequency, pose-aware information across views, while FBA focuses attention on non-overlapping regions in the Fourier domain, enabling coherent global appearance. The prompt cross-attention loss further aligns cross-view attention maps with ground-truth scene attention, yielding improved multi-view consistency. Quantitative and qualitative results on panoramic and depth-conditioned tasks demonstrate state-of-the-art performance and robust cross-view coherence, with potential implications for panoramic stills and temporally-consistent video synthesis conditioned on depth.

Abstract

Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.

Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

TL;DR

This work tackles the challenge of generating multi-view-consistent images from text prompts by introducing coordinate-based noise initialization, Fourier-based attention (FBA), and a prompt-driven cross-attention loss. The coordinate noise injects low-frequency, pose-aware information across views, while FBA focuses attention on non-overlapping regions in the Fourier domain, enabling coherent global appearance. The prompt cross-attention loss further aligns cross-view attention maps with ground-truth scene attention, yielding improved multi-view consistency. Quantitative and qualitative results on panoramic and depth-conditioned tasks demonstrate state-of-the-art performance and robust cross-view coherence, with potential implications for panoramic stills and temporally-consistent video synthesis conditioned on depth.

Abstract

Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.

Paper Structure

This paper contains 29 sections, 16 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We propose a method that addresses the lack of consistency in multi-view image generation by aligning appearance in non-overlapping regions of multi-view scenes (left). Compared to MVDiffusion, our approach improves the consistency of textures and geometry, particularly in non-overlapping regions (e.g., floor and walls; right).
  • Figure 2: Overview of our proposed technique. (Left) We initialize noise by sampling Gaussian noise shared across views as well as independent per view. We combined the shared noise with depth or transformed pixel coordinates to obtain Coordinate Noise (Sec. \ref{['sec:noise_init']}), which provides a low spatial frequency bias to inform the overall structure of the scene. (Center) We add our Fourier-based Attention (FBA) blocks (blue, see right panel) within the U-Net architecture and introduce a cross-attention loss (Sec. \ref{['sec:xa_loss']}) to ensure consistent spatial relationships across views. (Right) Finally, our novel attention module (Sec. \ref{['sec:fba']}) time-dependent spatial frequencies of features generated from the Coordinate Noise in non-overlapping regions to better align the global appearance across the scene.
  • Figure 3: Qualitative comparison for panoramic image generation. Colored boxes highlight misalignment with prompt "a house with a pool in the backyard". See Section \ref{['sec:qual_eval']} for further detail.
  • Figure 4: Qualitative comparison for multi-view depth-to-image generation. Colored boxes highlight inconsistencies in baselines relative to our method. Blue, red and white boxes demonstrate how small objects, large objects and environment (i.e. non-overlapping regions) resp. change appearance in our baselines. We qualitatively outperform our baselines. See Section \ref{['sec:qual_eval']} for further detail.
  • Figure 5: Color/texture inconsistencies using CAA (MVDiffusion) vs. FBA (ours) blocks.
  • ...and 6 more figures