Table of Contents
Fetching ...

Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang

TL;DR

The paper tackles the challenge of maintaining 3D-consistent novel-view synthesis from a single image by introducing a training-free 3D epipolar attention mechanism. By locating and importing overlapping information from a reference view along epipolar lines and extending this to multi-view contexts, the approach enhances consistency without retraining a diffusion backbone. Key contributions include a parameter-duplicated epipolar attention block, DDIM-inversion–driven paired features, and a multi-view extension that aggregates information from multiple context views. Experiments on GSO and Objaverse show improved multi-view consistency and downstream 3D reconstruction quality, with a favorable trade-off between performance and memory compared to training-based baselines.

Abstract

Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.

Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

TL;DR

The paper tackles the challenge of maintaining 3D-consistent novel-view synthesis from a single image by introducing a training-free 3D epipolar attention mechanism. By locating and importing overlapping information from a reference view along epipolar lines and extending this to multi-view contexts, the approach enhances consistency without retraining a diffusion backbone. Key contributions include a parameter-duplicated epipolar attention block, DDIM-inversion–driven paired features, and a multi-view extension that aggregates information from multiple context views. Experiments on GSO and Objaverse show improved multi-view consistency and downstream 3D reconstruction quality, with a favorable trade-off between performance and memory compared to training-based baselines.

Abstract

Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.

Paper Structure

This paper contains 28 sections, 8 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Given an input image and a sequence of relative camera pose transformations, our method synthesizes more consistent novel view images. Our method does not need to re-train the baseline model (Zero123) and supports arbitrary relative camera poses.
  • Figure 2: When the camera viewing frustum of two views overlaps, for a point on one of the images, we can find its correspondence on the epipolar line of the other view.
  • Figure 3: Overview of our method. (a) We first perform DDIM inversion on the input image to obtain the initial noise, which is shared during the multi-view image generation process. Throughout the generation of each view, our epipolar attention block efficiently locates and retrieves corresponding information from both the input image and other target views. (b) The architecture of our 3D epipolar attention module. (c) Location of our inserted epipolar attention block.
  • Figure 4: Comparison between our epipolar attention and the full attention. Our epipolar attention better locates and retrieves the corresponding information in the reference view.
  • Figure 5: Qualitative comparison with the baseline for generating a sequence of novel view images. The results demonstrate that our method synthesizes more consistent multi-view images compared to our baseline model (Zero123). In addition, compared to SyncDreamer, our method visually maintains better similarity to the conditioned image and appears more natural.
  • ...and 10 more figures