Table of Contents
Fetching ...

GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk, Qingwen Zhang, Sibi Catley-Chandar, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero

Abstract

Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

Abstract

Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

Paper Structure

This paper contains 19 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Scene-specific approaches optimized under motion priors typically reconstruct static regions well but struggle with dynamic elements (e.g.4DGS Wu_2024_CVPR). Diffusion-based approaches conditioned on projected point-clouds typically struggle with fine-grained geometries (e.g. Gen3C ren2025gen3c). Our proposed method reconstructs static and dynamic elements with high accuracy.
  • Figure 2: GRVS. For a target view $\mathbf{J}_{i}$ at time $t_i$ and with camera parameters $\mathbf{Q}_{i}\,$, our Generalizable Recurrent View Synthesizer consists of 5 stages. 1) The selection of $V$ input views $\mathbf{I}_{i}$ uniformly sampled around the time $t_i\,$, with corresponding camera parameters $\mathbf{P}_{i}$. 2) The projection of $\mathbf{I}_{i}$ into a dynamic plane sweep volume $\mathbf{X}_{i}$ using the homographies $\mathcal{H}_{\mathbf{P}_{i} \to \mathbf{Q}_{i}}$. 3) The patchification and reshaping of $\mathbf{X}_{i}$ into a downsampled tensor $\mathbf{Y}_{i}$. 4) The latent rendering of $\mathbf{Y}_{i}$ into a hidden state $\mathbf{Z}_{i}$ using the recurrent hidden state $\mathbf{Z}_{i-1}$ projected using the homographies $\mathcal{H}_{\mathbf{Q}_{i-1} \to \mathbf{Q}_{i}}$. 5) The decoding of $\mathbf{Z}_{i}$ into the predicted output $\mathbf{\tilde{J}}_{i}$.
  • Figure 3: Dynamic Plane Sweep Volume. Projecting $V$ input images onto $D$ fronto-parallel planes facing the target camera produces a 5D plane sweep volume tensor of shape $D\!\times\!V\!\times\!3\!\times\!H\!\times\!W$, illustrated here on an example scene by averaging over the $V$ dimension and plotting the $D=6$ planes in a row (the depth increases left-to-right). When the scene is static (top), all the elements of the scene appear in focus at their respective depths. When the scene is dynamic (bottom), static elements are in focus and dynamic elements are not. Our model takes plane sweep volumes as input, computed over dynamic monocular sequences.
  • Figure 4: Qualitative evaluation on UCSD. On the left, we show the first and last frames for 2 input sequences. We then show the predictions of the three baselines and our method for a mid-sequence frame on those sequences. The ground-truths are shown on the right, with the dynamic elements highlighted.
  • Figure 5: Qualitative evaluation on Kubric-4D-dyn. At the top, we show every 10th frames for one input sequence, and a visualization of the relative positions of the input trajectory (in green) and three target cameras placed at increasing distances from the input trajectory (in red). Below, we show the predictions of the five baselines and our method for the three target views. The ground-truths are shown on the right, with the dynamic elements highlighted (yellow car, white and blue box, green shoes).
  • ...and 1 more figures