Table of Contents
Fetching ...

Light Field Diffusion for Single-View Novel View Synthesis

Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Hao Tang, Xiaohui Xie

TL;DR

This work introduces Light Field Diffusion (LFD), a diffusion-based framework for single-view novel view synthesis that replaces direct camera pose inputs with pixel-wise light field encodings to impose local 3D constraints. By implementing both image-space (Image LFD) and latent-space (Latent LFD) variants, the approach achieves superior view consistency and high-fidelity results, including strong zero-shot generalization to out-of-distribution data such as RTMV. The latent variant, finetuned on Objaverse, demonstrates state-of-the-art performance on several metrics and strong cross-dataset consistency, while the image variant on ShapeNet Car validates the method’s competitiveness against NeRF-based and diffusion baselines. Overall, LFD offers a scalable, geometry-aware diffusion paradigm that leverages light field representations to improve multi-view coherence in single-view NVS, with practical implications for 3D-consistent image synthesis from limited input data.

Abstract

Single-view novel view synthesis (NVS), the task of generating images from new viewpoints based on a single reference image, is important but challenging in computer vision. Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images. However, current diffusion-based methods typically utilize camera pose matrices to globally and implicitly enforce 3D constraints, which can lead to inconsistencies in images generated from varying viewpoints, particularly in regions with complex textures and structures. To address these limitations, we present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices. Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray. By integrating light field encoding with the reference image, our method imposes local pixel-wise constraints within the diffusion process, fostering enhanced view consistency. Our approach not only involves training image LFD on the ShapeNet Car dataset but also includes fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This enables our latent LFD model to exhibit remarkable zero-shot generalization capabilities across out-of-distribution datasets like RTMV as well as in-the-wild images. Experiments demonstrate that LFD not only produces high-fidelity images but also achieves superior 3D consistency in complex regions, outperforming existing novel view synthesis methods.

Light Field Diffusion for Single-View Novel View Synthesis

TL;DR

This work introduces Light Field Diffusion (LFD), a diffusion-based framework for single-view novel view synthesis that replaces direct camera pose inputs with pixel-wise light field encodings to impose local 3D constraints. By implementing both image-space (Image LFD) and latent-space (Latent LFD) variants, the approach achieves superior view consistency and high-fidelity results, including strong zero-shot generalization to out-of-distribution data such as RTMV. The latent variant, finetuned on Objaverse, demonstrates state-of-the-art performance on several metrics and strong cross-dataset consistency, while the image variant on ShapeNet Car validates the method’s competitiveness against NeRF-based and diffusion baselines. Overall, LFD offers a scalable, geometry-aware diffusion paradigm that leverages light field representations to improve multi-view coherence in single-view NVS, with practical implications for 3D-consistent image synthesis from limited input data.

Abstract

Single-view novel view synthesis (NVS), the task of generating images from new viewpoints based on a single reference image, is important but challenging in computer vision. Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images. However, current diffusion-based methods typically utilize camera pose matrices to globally and implicitly enforce 3D constraints, which can lead to inconsistencies in images generated from varying viewpoints, particularly in regions with complex textures and structures. To address these limitations, we present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices. Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray. By integrating light field encoding with the reference image, our method imposes local pixel-wise constraints within the diffusion process, fostering enhanced view consistency. Our approach not only involves training image LFD on the ShapeNet Car dataset but also includes fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This enables our latent LFD model to exhibit remarkable zero-shot generalization capabilities across out-of-distribution datasets like RTMV as well as in-the-wild images. Experiments demonstrate that LFD not only produces high-fidelity images but also achieves superior 3D consistency in complex regions, outperforming existing novel view synthesis methods.
Paper Structure (38 sections, 5 equations, 14 figures, 5 tables)

This paper contains 38 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Zero-shot single-view novel view synthesis by Latent Light Field Diffusion. Given a single input view, our method can generate novel views from various viewpoints while maintaining consistency with the reference image.
  • Figure 2: Comparison of our Light Field Diffusion and previous diffusion-based models for single-view novel view synthesis: a) Previous models watson2022novelliu2023zero1to3 directly take camera pose matrices (rotation R and translation T) as input, which can only provide 3D constraints globally and implicitly. b) Our Light Field Diffusion transforms the camera pose matrices into light field encoding and concatenates them with the noise and source image, which provides local and explicit pixel-wise 3D geometry constraints, enabling better novel view consistency.
  • Figure 3: The overall pipeline of Ligh Field Diffusion (LFD) in both latent space and image space. The LFD translates the input source camera pose $\mathbf{P^s}$ and target camera pose $\mathbf{P^t}$ into source light field $\mathbf{L^s}$ and target light field $\mathbf{L^t}$. The U-Net denoiser takes the concatenation of noised image and $\mathbf{L^t}$ as inputs. The half U-Net extracts features from the concatenation of the source image and source light field $\mathbf{L^s}$ and interacts with the U-Net denoiser via cross-attention.
  • Figure 4: Comparison of latent LFD and Zero-1-to-3 liu2023zero1to3 on the Objaverse dataset. For each object, the first image is the input view. We randomly synthesize two novel views. More results and video visualization can be found in Supplementary.
  • Figure 5: Comparison of latent LFD and Zero-1-to-3 liu2023zero1to3 on RTMV dataset.
  • ...and 9 more figures