ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Xianghui Yang; Yan Zuo; Sameera Ramasinghe; Loris Bazzani; Gil Avraham; Anton van den Hengel

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel

TL;DR

ViewFusion addresses multi-view inconsistency in diffusion-based novel-view synthesis by introducing a training-free auto-regressive framework that leverages previously generated views via Interpolated Denoising. It extends pre-trained single-view diffusion models to multi-view conditioning through a Noise Interpolation Module and a principled weighting scheme that emphasizes near views while maintaining information from earlier conditions. Empirical results on ABO and GSO show improved multi-view consistency and 3D reconstruction quality with competitive image metrics, without finetuning or architectural changes to the base diffusion models. This approach enables scalable, plug-in enhancement of existing diffusion pipelines for robust multi-view synthesis and downstream 3D tasks.

Abstract

Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet, the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this, we introduce ViewFusion, a novel, training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation, ensuring robust multi-view consistency during the novel-view generation process. Through a diffusion process that fuses known-view information via interpolated denoising, our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning. Extensive experimental results demonstrate the effectiveness of ViewFusion in generating consistent and detailed novel views.

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

TL;DR

Abstract

Paper Structure (35 sections, 16 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 16 equations, 13 figures, 6 tables, 1 algorithm.

Introduction
Related Work
3D-adapted Diffusion Models
Novel View Synthesis Diffusion Models
Other Single-view Reconstruction Methods
Method
Denoising Diffusion Probabilistic Models
Pose-Conditional Diffusion Models
Direct condition
Stochastic conditioning
Joint output distribution
Auto-regressive distribution
Interpolated Denoising
Single and Multi-view Denoising
Step-by-step Generation
...and 20 more sections

Figures (13)

Figure 1: The cause of multi-view inconsistency in diffusion-based novel-view synthesis models. (a) Diffusion models incorporate randomness for diversity and better distribution modeling; this independent generation process produces realistic views under specific instances but may produce different plausible views for various instances, lacking alignment across adjacent views. (b) In contrast, ViewFusion incorporates an auto-regressive process to reduce uncertainty and achieve multi-view consistency, by ensuring a correlated denoising process that ends at the same high-density area, fostering consistency across views.
Figure 2: Illustration of the Auto-Regressive Generation Process. In our approach, we extend a pre-trained diffusion model from single-stage to multi-stage generation and we maintain a view set that contains all generated views. For each stage, we construct $N$ reverse diffusion processes and sharing a common starting noise. At each time step within this generation stage, the diffusion model predicts $N$ noises individually. These $N$ noises are then subjected to weighted interpolation through the Noise Interpolation Module, concluding the denoising step with the a shared interpolated noise for subsequent denoising steps.
Figure 3: Illustration of Step-by-step Generation. (a) we uniformly sample views along this trajectory in sequence to generate a novel-view image; (b) we sample views from nearest to furthest views according to to view distance to generate a $360^{\circ}$ spin video.
Figure 4: Qualitative results for $360^\circ$ Spin Video Generation. Note the additional consistency in generated views our approach offers over the competing baselines shown in the bounding boxes.
Figure 5: Qualitative comparison for Motion Smoothness. We visualize the output videos using space-time Y-t slices through frames of the generated spin video (along the scanline shown in the condition).
...and 8 more figures

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

TL;DR

Abstract

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Authors

TL;DR

Abstract

Table of Contents

Figures (13)