Table of Contents
Fetching ...

GFlow: Recovering 4D World from Monocular Video

Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, Xinchao Wang

TL;DR

GFlow addresses the challenge of recovering a 4D dynamic world from a single monocular video without known camera parameters by modeling the scene as a flow of explicit 3D Gaussians. It derives depth, optical flow, and intrinsics priors, then alternates between per-frame camera pose optimization and Gaussian-point refinement, incorporating prior-driven initialization and pixel-wise densification to handle dynamic content. The method introduces movement-based Gaussian clustering and an isotropic loss to stabilize reconstruction in sparse-view monocular settings, achieving state-of-the-art reconstruction quality on DAVIS while enabling downstream tasks such as point tracking, video segmentation, and editing through an explicit, editable representation. Overall, GFlow offers a practical, explicit, and versatile approach to 4D reconstruction from casual monocular footage, with clear benefits for view synthesis and content editing in real-world scenes.

Abstract

Recovering 4D world from monocular video is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes. In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses. To solve this, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points. This method ensures consistency among adjacent points and smooth transitions between frames. Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames. Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness. Visit our project page at: https://littlepure2333.github.io/GFlow

GFlow: Recovering 4D World from Monocular Video

TL;DR

GFlow addresses the challenge of recovering a 4D dynamic world from a single monocular video without known camera parameters by modeling the scene as a flow of explicit 3D Gaussians. It derives depth, optical flow, and intrinsics priors, then alternates between per-frame camera pose optimization and Gaussian-point refinement, incorporating prior-driven initialization and pixel-wise densification to handle dynamic content. The method introduces movement-based Gaussian clustering and an isotropic loss to stabilize reconstruction in sparse-view monocular settings, achieving state-of-the-art reconstruction quality on DAVIS while enabling downstream tasks such as point tracking, video segmentation, and editing through an explicit, editable representation. Overall, GFlow offers a practical, explicit, and versatile approach to 4D reconstruction from casual monocular footage, with clear benefits for view synthesis and content editing in real-world scenes.

Abstract

Recovering 4D world from monocular video is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes. In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses. To solve this, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points. This method ensures consistency among adjacent points and smooth transitions between frames. Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames. Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness. Visit our project page at: https://littlepure2333.github.io/GFlow
Paper Structure (46 sections, 6 equations, 7 figures, 2 tables)

This paper contains 46 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: A) Given a monocular video in the wild, B) our proposed GFlow can reconstruct the underlying 4D world, i.e. the dynamic scene represented by 3D Gaussian splatting kerbl20233d and associated camera poses. Within GFlow, the Gaussians are split into still and moving clusters and and are further densified. C) GFlow facilitates a range of applications, including tracking objects in 2D and 3D, segmenting video objects, synthesizing new views, estimating consistent depth and video editing. We encourage readers to visit the anonymous website for more video illustrations.
  • Figure 2: Overview of GFlow.A) Given a monocular video input consisting of image sequence $\{I_t\}$, the associated depth $\{D_t\}$, optical flow $\{F_t\}$ and camera intrinsic $K$ are obtained using off-the-shelf prior. B) For each frame , GFLow first clustering the scene into still part $\{G_t^{s}\}$ and moving part $\{G_t^{m}\}$. Then optimization process in GFlow consists of two steps: C1) Only the camera pose $P_t$ is optimized by aligning the appearance, depth and optical flow within the still cluster. C2) Under the optimized camera pose $P_t^*$, the Gaussian points $\{G_t\}$ are optimized and densified based on appearance, depth, optical flow and the two scene clusters. D) The same procedure of steps B, C1, and C2 loops for the next frame. The colorful marks under the dashed line represent the variables involved in the optimization.
  • Figure 3: Our GFlow can explicitly model the dynamic 3D scene in the video. Here we show some rendered examples of videos from DAVIS perazzi2016benchmarkpont20172017 dataset in the 3D world space.
  • Figure 4: Visual comparison of reconstruction quality on the DAVIS perazzi2016benchmarkpont20172017 dataset: CoDef ouyang2023codef, RoDynRF liu2023robust, 4DGS yang2023gs4d, and Deformable Sprites ye2022sprites and Ours.
  • Figure 5: Point tracking visualization on DAVIS dataset. A) tracking in the 2D camera-view which contains joint motion of camera and content movement. B) tracking in the 3D world-coordinates which only present content movement.
  • ...and 2 more figures