Table of Contents
Fetching ...

Seeing World Dynamics in a Nutshell

Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, Xinchao Wang

TL;DR

NutWorld introduces a feed-forward framework that represents casual monocular videos as Spatial-Temporal Aligned Gaussian (STAG) primitives in a canonical orthographic space, enabling coherent 3D dynamics without per-scene optimization. A transformer-based encoder–decoder maps unposed frames to STAGs, while calibrated depth and optical-flow priors regularize depth and motion to resolve monocular ambiguities. The method supports segment-based long-video inference and delivers real-time reconstruction with strong performance on downstream tasks such as depth estimation, segmentation, frame interpolation, and novel-view synthesis. Experiments on MiraData and RealEstate10K show high reconstruction fidelity, substantial speedups over optimization-based methods, and broad applicability across video processing tasks, highlighting NutWorld as a versatile, general-purpose video representation framework.

Abstract

We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.

Seeing World Dynamics in a Nutshell

TL;DR

NutWorld introduces a feed-forward framework that represents casual monocular videos as Spatial-Temporal Aligned Gaussian (STAG) primitives in a canonical orthographic space, enabling coherent 3D dynamics without per-scene optimization. A transformer-based encoder–decoder maps unposed frames to STAGs, while calibrated depth and optical-flow priors regularize depth and motion to resolve monocular ambiguities. The method supports segment-based long-video inference and delivers real-time reconstruction with strong performance on downstream tasks such as depth estimation, segmentation, frame interpolation, and novel-view synthesis. Experiments on MiraData and RealEstate10K show high reconstruction fidelity, substantial speedups over optimization-based methods, and broad applicability across video processing tasks, highlighting NutWorld as a versatile, general-purpose video representation framework.

Abstract

We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.

Paper Structure

This paper contains 18 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The illustration of STAG to represent dynamic scenes.
  • Figure 2: Overview of NutWorld. We directly predict STAG in a canonical space from sparse input frames via a transformer-based reconstruction model, where calibrated depth and flow priors are leveraged to avoid depth ambiguity and motion uncertainty.
  • Figure 3: Qualitative comparison of video reconstruction using our NutWorld and other optimization-based methods.
  • Figure 4: Qualitative results in various downstream tasks, including video segmentation, editing, frame interpolation and consistent depth estimation. More visualization results for each task are presented in Appendix.
  • Figure 5: Visualization of Gaussian trajectories. Trajectories of selected Gaussian centers are illustrated as point tracks.
  • ...and 7 more figures