Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos
Kaihua Chen, Tarasha Khurana, Deva Ramanan
TL;DR
This work presents CogNVS, a diffusion-based video inpainting approach for dynamic novel-view synthesis from monocular videos. It decomposes the task into 3D reconstruction, rendering from a target pose, and inpainting of occluded regions, with CogNVS trained self-supervised on 2D videos and finetuned at test time to adapt to new footage. By leveraging co-visible pixels for rendering and 3D-aware inpainting, CogNVS achieves strong 3D-consistent reconstructions and competitive photorealism, outperforming prior methods on synthetic and real-world datasets. The combination of large-scale pretraining and targeted test-time finetuning provides a practical, robust solution for zero-shot dynamic view synthesis in unconstrained settings.
Abstract
We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.
