Table of Contents
Fetching ...

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Zhenggang Tang, Peiye Zhuang, Chaoyang Wang, Aliaksandr Siarohin, Yash Kant, Alexander Schwing, Sergey Tulyakov, Hsin-Ying Lee

TL;DR

This paper tackles pixel-level misalignment in image-to-multi-view diffusion by introducing depth-guided cross-view decoding. It integrates depth-truncated epipolar attention into the VAE decoder and uses structured-noise depth augmentation during training to bridge depth inaccuracies at inference, where NeuS provides coarse depth. The approach yields improved multi-view consistency and cross-view correspondences, and enhances downstream 3D reconstruction quality, as demonstrated on Google Scanned Objects and Objaverse-derived data. The method maintains compatibility with existing latent multi-view diffusion frameworks and highlights practical gains for 3D asset generation, while outlining remaining challenges in unseen view texturing and future decoder enhancements.

Abstract

The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

TL;DR

This paper tackles pixel-level misalignment in image-to-multi-view diffusion by introducing depth-guided cross-view decoding. It integrates depth-truncated epipolar attention into the VAE decoder and uses structured-noise depth augmentation during training to bridge depth inaccuracies at inference, where NeuS provides coarse depth. The approach yields improved multi-view consistency and cross-view correspondences, and enhances downstream 3D reconstruction quality, as demonstrated on Google Scanned Objects and Objaverse-derived data. The method maintains compatibility with existing latent multi-view diffusion frameworks and highlights practical gains for 3D asset generation, while outlining remaining challenges in unseen view texturing and future decoder enhancements.

Abstract

The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.
Paper Structure (15 sections, 3 equations, 6 figures, 2 tables)

This paper contains 15 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Visualization of our method. Comparing to the baseline methods (column 4-7, 10-11), our proposed method enables to generate pixel-aligned multi-view images, which can lead to better 3D reconstruction quality.
  • Figure 2: Overview. (top) We aim to achieve pixel-aligned multi-view image generation from the multi-view latents, either encoded or generated by a multi-view diffusion model. For this, we focus on improving the decoder. (bottom-left) The decoder contains four Up-blocks to upsample the resolution from $32$ to $256$. (bottom-right) We propose several additions, highlighted with blue color. We add a condition from the input front-view image, and a depth-truncated epipolar attention mechanism. Note that the $4^\text{th}$ Up-block does not have an upsampling layer, as the resolution is not changed.
  • Figure 3: Epipolar Attention. (a) Full epipolar attention aggregates information along the whole epiploar line, covering unnecessary ranges (only the red dot is the correct position), which limits applicability to lower resolution representations due to memory constraints. (b) Depth-truncated epipolar attention samples only points near the 3D location of that pixel (the red dot). It enables epipolar attention on higher-resolution representations and improves information aggregation.
  • Figure 4: Qualitative comparisons with baselines.
  • Figure 5: Qualitative comparisons after 3D rendering. To better understand the impact of pixel-level aligned multi-view images in the 3D generation pipeline, we reconstruct the 3D object using generated multi-view images. We can clearly observe that inconsistent multi-view images lead to reconstructed 3D objects which are blurry.
  • ...and 1 more figures