EscherNet: A Generative Model for Scalable View Synthesis

Xin Kong; Shikun Liu; Xiaoyang Lyu; Marwan Taher; Xiaojuan Qi; Andrew J. Davison

EscherNet: A Generative Model for Scalable View Synthesis

Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison

TL;DR

This work introduces EscherNet, a multi-view conditioned diffusion model for view synthesis that can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views.

Abstract

We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.

EscherNet: A Generative Model for Scalable View Synthesis

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 8 figures, 6 tables)

This paper contains 39 sections, 7 equations, 8 figures, 6 tables.

Introduction
Related Work
Neural 3D Representations
Novel View Synthesis
3D Diffusion Models
EscherNet
Problem Formulation and Notation
Architecture Design
Multi-View Generation
Conditioning Reference Views
Camera Positional Encoding (CaPE)
4 DoF CaPE
6 DoF CaPE
Experiments
Training Datasets
...and 24 more sections

Figures (8)

Figure 1: We introduce EscherNet, a diffusion model that can generate a flexible number of consistent target views (highlighted in blue) with arbitrary camera poses, based on a flexible number of reference views (highlighted in purple). EscherNet demonstrates remarkable precision in camera control and robust generalisation across synthetic and real-world images featuring multiple objects and rich textures.
Figure 2: 3D representations overview. EscherNet generates a set of $M$ target views ${\bf X}_{1:M}^T$ based on their camera poses ${\bf P}_{1:M}^T$, leveraging information gained from a set of $N$ reference views ${\bf X}_{1:N}^R$ and their camera poses ${\bf P}_{1:N}^R$. EscherNet presents a new way of learning implicit 3D representations by only considering the relative camera transformation between the camera poses of ${\bf P}^R$ and ${\bf P}^T$, making it easier to scale with multi-view posed images, independent of any specific coordinate systems.
Figure 3: EscherNet architecture details. EscherNet adopts the Stable Diffusion architectural design with minimal but important modifications. The lightweight vision encoder captures both high-level and low-level signals from $N$ reference views. In U-Net, we apply self-attention within $M$ target views to encourage target-to-target consistency, and cross-attention within $M$ target and $N$ reference views (encoded by the image encoder) to encourage reference-to-target consistency. In each attention block, CaPE is employed for the key and query, allowing the attention map to learn with relative camera poses, independent of specific coordinate systems.
Figure 4: Generated views visualisation on the NeRF Synthetic drum scene. EscherNet generates plausible view synthesis even when provided with very limited reference views, while neural rendering methods fail to generate any meaningful content. However, when we have more than 10 reference views, scene-specific methods exhibit a substantial improvement in rendering quality. We report the mean PSNR averaged across all test views from the drum scene. Results for other scenes and/or with more reference views are shown in Appendix \ref{['app:nerf']}.
Figure 5: Novel view synthesis visualisation on GSO and RTMV datasets. EscherNet outperforms Zero-1-to-3-XL, delivering superior generation quality and finer camera control. Notably, when conditioned with additional views, EscherNet exhibits an enhanced resemblance of the generated views to ground-truth textures, revealing more refined texture details such as in the backpack straps and turtle shell.
...and 3 more figures

EscherNet: A Generative Model for Scalable View Synthesis

TL;DR

Abstract

EscherNet: A Generative Model for Scalable View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)