DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

Shengqu Cai; Eric Ryan Chan; Songyou Peng; Mohamad Shahbazi; Anton Obukhov; Luc Van Gool; Gordon Wetzstein

DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, Gordon Wetzstein

TL;DR

DiffDreamer tackles the problem of long-range, single-view scene extrapolation by learning a conditional diffusion model trained on uncurated internet images. It introduces a render-refine-repeat pipeline and novel inference techniques—anchored conditioning and virtual lookahead conditioning—to enforce strong spatio-temporal consistency across many steps. The approach achieves superior 3D coherence compared with prior per-frame or GAN-based methods, while maintaining competitive 2D image fidelity metrics. This unsupervised framework enables realistic fly-through synthesis and has potential for downstream 3D representations and content creation, with room for speedups and diversification in future work.

Abstract

Scene extrapolation -- the idea of generating novel views by flying into a given image -- is a promising, yet challenging task. For each predicted frame, a joint inpainting and 3D refinement problem has to be solved, which is ill posed and includes a high level of ambiguity. Moreover, training data for long-range scenes is difficult to obtain and usually lacks sufficient views to infer accurate camera poses. We introduce DiffDreamer, an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory while training solely on internet-collected images of nature scenes. Utilizing the stochastic nature of the guided denoising steps, we train the diffusion models to refine projected RGBD images but condition the denoising steps on multiple past and future frames for inference. We demonstrate that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods. DiffDreamer is a powerful and efficient solution for scene extrapolation, producing impressive results despite limited supervision. Project page: https://primecai.github.io/diffdreamer.

DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

TL;DR

Abstract

Paper Structure (32 sections, 2 equations, 14 figures, 6 tables)

This paper contains 32 sections, 2 equations, 14 figures, 6 tables.

Introduction
Related works
Novel view synthesis from multi-view images
Novel view synthesis from a single image
Scene extrapolation
Long-term path synthesis
Diffusion models
DiffDreamer
Training
Inference
Anchored conditioning
Virtual lookahead conditioning
Training details
Experiments
Evaluation
...and 17 more sections

Figures (14)

Figure 1: DiffDreamer (Top) is a novel diffusion-based approach for scene extrapolation. It exhibits high spatio-temporal consistency, a desired property missing in prior art, such as InfNat-0 li2022infnat_zero (Bottom). We check for consistency by extracting keypoints from the sequences with COLMAP, resulting in point clouds of vastly different sizes and sparsity (Right).
Figure 2: Overview of our pipeline. We train an image-conditional diffusion model to perform image-to-image refinement and inpainting given a corrupted image and its missing region mask. At inference, we perform stochastic conditioning on three conditionings: naive forward warping from the previous frame (black arrow), anchored conditioning by warping a further frame (blue arrow), and lookahead conditioning by warping a virtual future frame (red arrow). We repeat this render-refine-repeat pipeline to get sequences extrapolating a given image.
Figure 3: Qualitative comparisons of InfNat-0 li2022infnat_zero and our DiffDreamer generation, for which we ask the models to fly toward a target region and compare the outputs. Note that as InfNat-0 li2022infnat_zero is not 3D consistent and may need more steps even with identical input disparities and camera speed, we manually inserted more refinement steps to our DiffDreamer to ensure it is a fair comparison. Even so, we do not observe significant drifting from our DiffDreamer, while InfNat-0 li2022infnat_zero is incapable of preserving the input domain.
Figure 4: Long-range view extrapolation of over 50 steps forward.
Figure 5: Comparison of 3D consistency achieved by our DiffDreamer and InfNat-0 li2022infnat_zero, where we ask the camera to fly towards the top of the hill and show the intermediate renderings at camera positions $c_\mathrm{0}$ to $c_\mathrm{5}$.
...and 9 more figures

DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

TL;DR

Abstract

DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)