Table of Contents
Fetching ...

Fast View Synthesis of Casual Videos with Soup-of-Planes

Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu

TL;DR

The paper addresses the challenge of novel view synthesis from monocular in-the-wild videos by proposing a hybrid explicit representation that separates static and dynamic content. Static content is modeled with an extended soup-of-planes augmented with spherical harmonics and displacement maps to capture view-dependent effects and non-flat geometries, while dynamic content is represented per-frame as point clouds with temporal blending for consistency. The method enables fast per-video optimization (about 15 minutes) and real-time rendering (around 27 FPS), achieving quality comparable to state-of-the-art NeRF-based methods while dramatically reducing training and rendering time. Evaluations on NVIDIA and DAVIS datasets demonstrate competitive perceptual quality with substantial speedups, highlighting practical applicability for efficient cross-view video synthesis in the wild.

Abstract

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

Fast View Synthesis of Casual Videos with Soup-of-Planes

TL;DR

The paper addresses the challenge of novel view synthesis from monocular in-the-wild videos by proposing a hybrid explicit representation that separates static and dynamic content. Static content is modeled with an extended soup-of-planes augmented with spherical harmonics and displacement maps to capture view-dependent effects and non-flat geometries, while dynamic content is represented per-frame as point clouds with temporal blending for consistency. The method enables fast per-video optimization (about 15 minutes) and real-time rendering (around 27 FPS), achieving quality comparable to state-of-the-art NeRF-based methods while dramatically reducing training and rendering time. Evaluations on NVIDIA and DAVIS datasets demonstrate competitive perceptual quality with substantial speedups, highlighting practical applicability for efficient cross-view video synthesis in the wild.

Abstract

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.
Paper Structure (17 sections, 9 equations, 10 figures, 2 tables)

This paper contains 17 sections, 9 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Efficient dynamic novel view synthesis. Our method only takes 15 minutes to optimize a representation from an in-the-wild video and can render novel views at 27 FPS. On the NVIDIA Dataset yoon2020nvidia, our method achieves a rendering quality comparable to the state-of-the-art NeRF-based methods but is much faster to train and render. The bubble size in the figure indicates the training time (GPU-hours).
  • Figure 2: 3DGS 3dgaussian fails in weak-parallax videos. We show two casual videos in DAVIS davis that the 3D reconstructions show the weak parallax scenes by only camera rotations. We use the ground truth masks to filter out the dynamics and only reconstruct the static scenes. We utilize the same 3D point cloud from video depth as the initialization for 3DGS and our method. 3DGS cannot handle such casual videos and produce floaters and noises in novel views due to insufficient parallax cues.
  • Figure 3: Method overview. We first preprocess an input monocular video to obtain the video depth and pose as well as the dynamic masks (Sec. \ref{['sec:preprocessing']}). The input video is then decomposed into static and dynamic content. We initialize a soup of oriented planes by fitting them to the static scene. These planes are augmented to capture view-dependent effects and non-planar complex surfaces. These planes are warped to the target view and composited from far to near to generate the target static view (Sec. \ref{['sec:static_model']}). We estimate per-frame point clouds for dynamic content together with dynamic masks (Sec. \ref{['sec:dynamic_model']}). For temporal consistency, we use optical flows to blend the dynamic content from neighboring frames. The blended dynamics is then warped to the target view. Finally, the target novel view is composited by the static and dynamic content.
  • Figure 4: View-dependent texture. (a) Since a flat plane cannot sufficiently represent a non-flat surface, different viewing rays look at the same actual point but hit the plane in different locations (red arrow) and query different RGBA values. (b) We augment it with both view-dependent color and displacement. The RGBA should be displaced to different locations depending on different viewing rays. Both of them are encoded by spherical harmonic coefficients, $\mathcal{C}^{0..{\ell_{\max}^{\mathcal{C}}}}$ and $\Delta^{0..{\ell_{\max}^{\Delta}}}$, respectively. Given a view direction $\mathbf{v}$, we first obtain the view-specific color $\mathcal{C}^\mathbf{v}$ and displacement $\Delta^\mathbf{v}$, then displace $\mathcal{C}^\mathbf{v}$ into the final view-specific $\mathcal{C}^{\mathbf{v},\Delta^{\mathbf{v}}}$ texture for planar homography warping to the target view. Note that the transparency map $\alpha$ is shifted jointly with $\mathcal{C}^\mathbf{v}$.
  • Figure 5: Visual comparison on the NVIDIA dataset. Our method can achieve comparable rendering quality for both static and dynamic content. Although our rendered dynamics may slightly misalign with the ground truth due to the ill-posed dynamic depth estimation problem, our results are sharp and perceptually similar to the ground truth. $\dagger$We reproduced tian2023mononerf's per-scene optimization results by their official codes.
  • ...and 5 more figures