Table of Contents
Fetching ...

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo

TL;DR

Era3D tackles camera-prior distortions in single-view 3D reconstruction by introducing a camera-prediction module and a canonical orthogonal-view generation pipeline. It pairs Elevation and Focal length Regression (EFReg) with a novel row-wise multiview attention to enable high-resolution diffusion across views with substantially reduced compute. The approach achieves state-of-the-art results on novel-view synthesis and 3D mesh reconstruction at 512×512 resolution, while maintaining efficiency. Practical impact includes more scalable, detailed 3D generation from a single image, though limitations in modeling very thin structures and open meshes are acknowledged alongside ethics considerations.

Abstract

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/.

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

TL;DR

Era3D tackles camera-prior distortions in single-view 3D reconstruction by introducing a camera-prediction module and a canonical orthogonal-view generation pipeline. It pairs Elevation and Focal length Regression (EFReg) with a novel row-wise multiview attention to enable high-resolution diffusion across views with substantially reduced compute. The approach achieves state-of-the-art results on novel-view synthesis and 3D mesh reconstruction at 512×512 resolution, while maintaining efficiency. Practical impact includes more scalable, detailed 3D generation from a single image, though limitations in modeling very thin structures and open meshes are acknowledged alongside ethics considerations.

Abstract

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/.
Paper Structure (15 sections, 1 theorem, 8 equations, 12 figures, 8 tables)

This paper contains 15 sections, 1 theorem, 8 equations, 12 figures, 8 tables.

Key Result

Proposition 1

If two orthogonal cameras look at the origin with their $y$ coordinate aligned with gravity direction and their elevations of $0^\circ$ as shown in Fig. fig:attention_comp(d), then for a pixel with coordinate $(x,y) =(u, v)$ on one camera, its corresponding epipolar line on other views is $y=v$.

Figures (12)

  • Figure 1: Given single-view image with arbitrary intrinsic and viewpoints, Era3D can generate high-quality multiview images with a resolution of $512\times512$ on the orthogonal camera setting, which can be used in mesh reconstruction by NeuS wang2021neus.
  • Figure 2: (top) Perspective input images for Wonder3D produce extreme distortion in the generation. (bottom) Era3D can handle images of commonly used intrinsics.
  • Figure 3: Different types of multiview attention layers. (a) In a dense multiview attention layer, all feature vectors of multiview images are fed into an attention block. For a general camera setting (b) with arbitrary viewpoints and intrinsics, utilizing epipolar constraint to construct an epipolar attention (c) needs to correlate the features on the epipolar line. This means that we need to sample $K$ points along each epipolar line to compute such an attention layer. In our canonical camera setting (d) with orthogonal cameras and viewpoints on an elevation of 0$^{\circ}$, epipolar lines align with the row of the images across different views (e), which eliminates the need to resample epipolar line to compute epipolar attention. We assume the latent feature map has a resolution of $H\times W$ and $H=W=S$. In such a $N$-view camera system, row-wise attention reduces the computational complexity to $O(N^2S^3)$.
  • Figure 4: Overview. Given a single-view image as input, Era3D applies multiview diffusion to generate multiview consistent images and normal maps in the canonical camera setting, which enables us to reconstruct 3D meshes using NeuS wang2021neusmuller2022instantnsr.
  • Figure 5: Qualitative comparison of 3D reconstruction results on the GSO dataset downs2022gso. Era3D produces the most high-quality 3D meshes with more details than baseline methods.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Proposition 1