Table of Contents
Fetching ...

Projected Representation Conditioning for High-fidelity Novel View Synthesis

Min-Seop Kwak, Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seungryong Kim

TL;DR

This work tackles high-fidelity novel view synthesis from sparse unposed imagery using diffusion models. It introduces ReNoV, a representation-guided diffusion framework that projects external representations into 3D space and reprojects them to a target viewpoint, producing the target view image $I_{\text{tgt}}$ at viewpoint $\pi_{\text{tgt}}$ conditioned on $\pi_{\text{tgt}}$. A key finding is that external representations such as VGGT and DA3-L encode strong geometric correspondence, enabling both reconstruction of visible regions and coherent inpainting of occluded areas. Empirical results on RealEstate10K and DTU show competitive or superior performance to state-of-the-art diffusion-based and non-generative methods, demonstrating robust extrapolation and 3D-consistent novel-view synthesis.

Abstract

We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

Projected Representation Conditioning for High-fidelity Novel View Synthesis

TL;DR

This work tackles high-fidelity novel view synthesis from sparse unposed imagery using diffusion models. It introduces ReNoV, a representation-guided diffusion framework that projects external representations into 3D space and reprojects them to a target viewpoint, producing the target view image at viewpoint conditioned on . A key finding is that external representations such as VGGT and DA3-L encode strong geometric correspondence, enabling both reconstruction of visible regions and coherent inpainting of occluded areas. Empirical results on RealEstate10K and DTU show competitive or superior performance to state-of-the-art diffusion-based and non-generative methods, demonstrating robust extrapolation and 3D-consistent novel-view synthesis.

Abstract

We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
Paper Structure (37 sections, 4 equations, 15 figures, 7 tables)

This paper contains 37 sections, 4 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Cross-view attention maps of the denoising network seo2024genwarpkwak2025aligned. A query pixel (blue dot) is chosen in the warped target view, and the resulting cross-attention weights on two reference images are visualized. Inpainting: the wheel is absent in the warped view, so attention shifts to the corresponding wheels in the references. Reconstruction: the suitcase edge is visible, so attention concentrates on the geometrically aligned edges to refine the reconstruction.
  • Figure 2: Analysis of visual foundation models. (a) Geometric correspondence, (b) Semantic corrspondence & (c) Local vs. Distant Similarity across feature layers in VGGT wang2025vggt, DA3-Large lin2025depth & DINOv2-Large oquab2023dinov2.
  • Figure 3: Geometric correspondence. A query point (blue dot) is selected in Frame 1, and cosine similarity maps are computed in Frame 2 and Frame 3. The scene contains featureless walls, allowing assessment of whether the model can localize the geometrically corresponding instance. Deeper layers of VGGT and DA3-L accurately identify the correct location in the corner wall aligned with the query point, while early layer 0 of VGGT and the feature of DINOv2 attend to incorrect but semantically similar locations in the wall. This illustrates that deeper layers of VGGT and DA3-L capture geometric structure more reliably than others.
  • Figure 4: Qualitative results for feature reconstruction analysis. We warp the extracted features using point clouds, resulting in feature-level holes that require inpainting.
  • Figure 5: Model architecture. Given $N$ reference images, we extract visual features, dense point clouds, and camera poses using an external representation model (e.g., VGGT wang2025vggt, DA3 lin2025depth, or DINOv2 oquab2023dinov2). These components undergo projected representation conditioning, where reference features and point clouds are projected into the target camera frustum to form warped representation and point-map planes. The reference network aggregates these multi-view inputs by passing them as keys and values to denoising network. Simultaneously, the denoising network receives the projected feature and point cloud planes as direct conditioning, aggregating reference cues to synthesize the novel view image.
  • ...and 10 more figures