Light Field Diffusion for Single-View Novel View Synthesis
Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Hao Tang, Xiaohui Xie
TL;DR
This work introduces Light Field Diffusion (LFD), a diffusion-based framework for single-view novel view synthesis that replaces direct camera pose inputs with pixel-wise light field encodings to impose local 3D constraints. By implementing both image-space (Image LFD) and latent-space (Latent LFD) variants, the approach achieves superior view consistency and high-fidelity results, including strong zero-shot generalization to out-of-distribution data such as RTMV. The latent variant, finetuned on Objaverse, demonstrates state-of-the-art performance on several metrics and strong cross-dataset consistency, while the image variant on ShapeNet Car validates the method’s competitiveness against NeRF-based and diffusion baselines. Overall, LFD offers a scalable, geometry-aware diffusion paradigm that leverages light field representations to improve multi-view coherence in single-view NVS, with practical implications for 3D-consistent image synthesis from limited input data.
Abstract
Single-view novel view synthesis (NVS), the task of generating images from new viewpoints based on a single reference image, is important but challenging in computer vision. Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images. However, current diffusion-based methods typically utilize camera pose matrices to globally and implicitly enforce 3D constraints, which can lead to inconsistencies in images generated from varying viewpoints, particularly in regions with complex textures and structures. To address these limitations, we present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices. Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray. By integrating light field encoding with the reference image, our method imposes local pixel-wise constraints within the diffusion process, fostering enhanced view consistency. Our approach not only involves training image LFD on the ShapeNet Car dataset but also includes fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This enables our latent LFD model to exhibit remarkable zero-shot generalization capabilities across out-of-distribution datasets like RTMV as well as in-the-wild images. Experiments demonstrate that LFD not only produces high-fidelity images but also achieves superior 3D consistency in complex regions, outperforming existing novel view synthesis methods.
