Table of Contents
Fetching ...

PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization

Yezhi Shen, Qiuchen Zhai, Fengqing Zhu

TL;DR

The paper tackles the limited viewpoint coverage in neural rendering byIntroducing PS4PRO, a lightweight, flow-based video frame interpolation model trained on diverse video data to implicitly encode camera motion and 3D geometry. It introduces pixel-to-pixel supervision to enforce cross-frame consistency, and integrates PS4PRO as a data augmentation tool that generates intermediate views ($I_t$) to enrich neural rendering training. The approach improves reconstruction accuracy for both static and dynamic scenes when applied to NeRF/3DGS-based methods, with minimal computational overhead. Extensive experiments across frame interpolation benchmarks and neural rendering systems demonstrate broad generalization and notable improvements in PSNR, SSIM, and LPIPS metrics. Overall, PS4PRO provides a practical, scalable augmentation strategy for neural rendering pipelines facing sparse or unobserved viewpoints.

Abstract

Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in complex and dynamic scenes, where certain angles of objects are never seen. In this paper, we propose to use video frame interpolation as the data augmentation method for neural rendering. Furthermore, we design a lightweight yet high-quality video frame interpolation model, PS4PRO (Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization). PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry. Our model performs as an implicit world prior, enriching the photo supervision for 3D reconstruction. By leveraging the proposed method, we effectively augment existing datasets for neural rendering methods. Our experimental results indicate that our method improves the reconstruction performance on both static and dynamic scenes.

PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization

TL;DR

The paper tackles the limited viewpoint coverage in neural rendering byIntroducing PS4PRO, a lightweight, flow-based video frame interpolation model trained on diverse video data to implicitly encode camera motion and 3D geometry. It introduces pixel-to-pixel supervision to enforce cross-frame consistency, and integrates PS4PRO as a data augmentation tool that generates intermediate views () to enrich neural rendering training. The approach improves reconstruction accuracy for both static and dynamic scenes when applied to NeRF/3DGS-based methods, with minimal computational overhead. Extensive experiments across frame interpolation benchmarks and neural rendering systems demonstrate broad generalization and notable improvements in PSNR, SSIM, and LPIPS metrics. Overall, PS4PRO provides a practical, scalable augmentation strategy for neural rendering pipelines facing sparse or unobserved viewpoints.

Abstract

Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in complex and dynamic scenes, where certain angles of objects are never seen. In this paper, we propose to use video frame interpolation as the data augmentation method for neural rendering. Furthermore, we design a lightweight yet high-quality video frame interpolation model, PS4PRO (Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization). PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry. Our model performs as an implicit world prior, enriching the photo supervision for 3D reconstruction. By leveraging the proposed method, we effectively augment existing datasets for neural rendering methods. Our experimental results indicate that our method improves the reconstruction performance on both static and dynamic scenes.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the overall framework of the proposed method. Given input frames $I_0$ and $I_1$, the proposed frame interpolation model synthesizes the intermediate view $I_t$ at time $t$. Then both the original views and intermediate views are used to optimize the neural rendering model. Notably, the figure illustrates one interpolated frame, while the model is capable of generating multiple intermediate frames at different time steps. See Figure \ref{['fig:vfi_block']} for more details on the Base Block and Refinement Block.
  • Figure 2: Illustration of reduction of the reconstruction uncertainty by introducing intermediate frames as supervision.
  • Figure 3: Architecture of Base and Refinement Blocks in PS4PRO.
  • Figure 4: The visual comparison of KITTI dataset reconstructed using Lightning-NeRF without (left column) and with (right column) the data augmentation using PS4PRO. The image pairs from top to bottom are extracted from sequences 1 to 5, respectively.