Table of Contents
Fetching ...

Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Kaihua Chen, Tarasha Khurana, Deva Ramanan

TL;DR

This work presents CogNVS, a diffusion-based video inpainting approach for dynamic novel-view synthesis from monocular videos. It decomposes the task into 3D reconstruction, rendering from a target pose, and inpainting of occluded regions, with CogNVS trained self-supervised on 2D videos and finetuned at test time to adapt to new footage. By leveraging co-visible pixels for rendering and 3D-aware inpainting, CogNVS achieves strong 3D-consistent reconstructions and competitive photorealism, outperforming prior methods on synthetic and real-world datasets. The combination of large-scale pretraining and targeted test-time finetuning provides a practical, robust solution for zero-shot dynamic view synthesis in unconstrained settings.

Abstract

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

TL;DR

This work presents CogNVS, a diffusion-based video inpainting approach for dynamic novel-view synthesis from monocular videos. It decomposes the task into 3D reconstruction, rendering from a target pose, and inpainting of occluded regions, with CogNVS trained self-supervised on 2D videos and finetuned at test time to adapt to new footage. By leveraging co-visible pixels for rendering and 3D-aware inpainting, CogNVS achieves strong 3D-consistent reconstructions and competitive photorealism, outperforming prior methods on synthetic and real-world datasets. The combination of large-scale pretraining and targeted test-time finetuning provides a practical, robust solution for zero-shot dynamic view synthesis in unconstrained settings.

Abstract

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Paper Structure

This paper contains 41 sections, 6 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: We present CogNVS, a video diffusion model that enables novel-view synthesis of dynamic scenes. Given an in-the-wild monocular video of a dynamic scene, we first reconstruct the scene, render it from the target novel-view and inpaint any unobserved regions. Because CogNVS can be pre-trained via self-supervision, it can also be test-time-finetuned on a given target video, enabling it to zero-shot generalize to novel domains. Our simple pipeline outperforms almost all prior state-of-the-art for dynamic novel-view synthesis. We show outputs from CogNVS from two unseen videos; a real-world video above, and a generated video below.
  • Figure 2: CogNVS overview. During training (left), given a 2D source video (in blue) of a dynamic scene, we first reconstruct the scene using off-the-shelf monocular reconstruction algorithms like MegaSAM li2024megasam to obtain the 3D scene geometry, $\mathcal{G}_{\rm src}$ and camera odometry, $\mathbf{c}_{\rm src}$. We then sample a set of arbitrary camera trajectories $\{\mathbf{c}_1, \cdots, \mathbf{c}_N\}$ to simulate plausible occluded geometries, $\{\mathcal{G}^{\rm cov}_{{\rm src},1}, \cdots, \mathcal{G}^{\rm cov}_{{\rm src},N}\}$ which when rendered from original camera trajectory, $\mathbf{c}_{\rm src}$ produces a mask of source pixels that are co-visible in the sampled trajectory (in orange). The source video and its masked variant produce a self-supervised training pair for learning CogNVS, our video inpainting diffusion model (visualized in Fig. \ref{['fig:datageneration']}). At inference (right), we finetune CogNVS on the given input sequence by similarly constructing self-supervised training pairs. The final novel-view is then generated using the finetuned CogNVS in a feed-forward manner.
  • Figure 3: Self-supervised training data generation. To curate a large training set for video inpainting, we first reconstruct an input source 2D video (in blue) with an off-the-shelf monocular SLAM system. After reconstruction, we randomly sample $N$ pairs of 'start' and 'end' camera poses around a spherical region, $\mathcal{S}$ of the estimated camera pose in the given 2D video. $\mathcal{S}$ is bounded by a predefined deviation in the spherical coordinate axes, similar to a prior work yu2024viewcrafter. We sample a ${\rm SE(3)}$ camera trajectory that interpolates the start and end poses while looking at the center of the scene. We render the reconstruction from this novel trajectory (in dotted-orange), and use the rendering to identify co-visible pixels in the original source view (in orange). The source video and its masked variant are used to produce a self-supervised training pair for training CogNVS, our "3D-aware" video inpainting diffusion model.
  • Figure 4: We show a qualitative comparison with state-of-the-art approaches for dynamic novel-view synthesis on Kubric-4D (top), ParallelDomain-4D (middle) and DyCheck (bottom). Note how reconstruction alone, either by groundtruth depth, MegaSAM li2024megasam, Shape of Motion wang2024shapeom, or MoSca lei2024mosca cannot synthesize a complete novel view. Optimization based approaches like Shape of Motion, and MoSca, blur the dynamic regions when fitting 4D representations. CAT4D wu2024cat4d, whose visuals are taken from its project page due to unavailable code, struggles to generalize. TrajectoryCrafter yu2025trajectorycrafter over-hallucinates the occluded regions and does not preserve geometry. GCD gcd performs well because it was trained on Kubric-4D and ParallelDomain-4D. Our method can instead produce photorealistic and 3D-consistent novel-views for the given scenes in a zero-shot manner with test-time finetuning, even starting from point cloud renders that are incomplete and noisy (e.g., from MegaSAM for DyCheck). It is consistently able to synthesize sharp dynamic objects, which the other baselines struggle with. Please see the video in the appendix.
  • Figure 5: We qualitatively analyze the effect of pretraining and test-time finetuning. We note that without the data-driven robustness and generalization of pretraining (second column), CogNVS cannot hallucinate missing regions properly (e.g., inpainted region in first row is still black in top left corner). Finally, without test-time finetuning (third column), 3D consistency and adherence to scene lighting and appearance properties cannot be ensured (e.g., overall darker scene in second row, and output off by a few pixels at the bottom and right side of the image in first row, thereby inhibiting geometric consistency).
  • ...and 10 more figures