Projected Representation Conditioning for High-fidelity Novel View Synthesis
Min-Seop Kwak, Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seungryong Kim
TL;DR
This work tackles high-fidelity novel view synthesis from sparse unposed imagery using diffusion models. It introduces ReNoV, a representation-guided diffusion framework that projects external representations into 3D space and reprojects them to a target viewpoint, producing the target view image $I_{\text{tgt}}$ at viewpoint $\pi_{\text{tgt}}$ conditioned on $\pi_{\text{tgt}}$. A key finding is that external representations such as VGGT and DA3-L encode strong geometric correspondence, enabling both reconstruction of visible regions and coherent inpainting of occluded areas. Empirical results on RealEstate10K and DTU show competitive or superior performance to state-of-the-art diffusion-based and non-generative methods, demonstrating robust extrapolation and 3D-consistent novel-view synthesis.
Abstract
We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
