Table of Contents
Fetching ...

DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Maximilian Seitzer, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff, Mehdi S. M. Sajjadi

TL;DR

The Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose.

Abstract

Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.

DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

TL;DR

The Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose.

Abstract

Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
Paper Structure (25 sections, 9 equations, 9 figures, 1 table)

This paper contains 25 sections, 9 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The DyST model. Input views of a scene are encoded into the scene representation $\mathcal{Z}$ capturing the scene content. The model is trained with an $L_2$ loss by synthesizing novel target views from $\mathcal{Z}$. To identify the target view $y_{d_1}^{c_1}$, the camera and dynamics estimators produce the low-dimensional camera control latents $\hat{c}_1, \hat{d}_1$ from views with matching camera ($y_{d_2}^{c_1}$) and dynamics ($y_{d_1}^{c_2}$). This scheme, termed latent control swap, induces a separation of camera and scene dynamics in the latent space (see \ref{['subsec:inducing-latent-structure']}). We co-train DyST on synthetic multi-view scenes and real-world monocular video, transferring the latent structure and thereby enabling controlled generation on real videos.
  • Figure 2: Illustration of a DySO scene.
  • Figure 3: NVS on DySO. Left: Qualitative results. DyST is able to learn how to extract camera & dynamics independently from the respective views of the scene, leading to the correct prediction of the GT image. Right: Quantitative performance for various inputs for $\mathop{\mathrm{CE}}\nolimits_\theta$ & $\mathop{\mathrm{DE}}\nolimits_\theta$. PSNR is high even when camera or dynamics are only from matching views, showing that the model is capable of estimating camera & dynamics independently of the other.
  • Figure 4: Frame synthesis on SSv2. We use the first, middle, and last frame as input views (marked in purple), and generate the intermediate frames based on the control latents estimated from them. DyST is able to render videos with challenging camera (left) and object motions (right).
  • Figure 5: Latent distance analysis. Left: average L2 distances in the latent space between pairs of views on DySO, for camera (left) and dynamics control latents (right). Right: frame-to-frame L2 distances for a real world video, for camera (left) and dynamics control latents (right). The distances closely follow events in the video: grasping (second frame), turning (third frame) and placing the cup (last frame) are visible as distinct areas in the dynamics latent distances. The slow panning camera movements is reflected in the broad diagonal stripe for the camera latent distances. Notably, the grasp has low distance to the similar placement motion despite the different camera positions, indicating that the model has learned to encode dynamics independently of the camera pose.
  • ...and 4 more figures