Table of Contents
Fetching ...

WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains

Qisen Wang, Yifan Zhao, Jia Li

TL;DR

WorldTree tackles monocular dynamic reconstruction by decomposing spatiotemporal dynamics with a Temporal Partition Tree (TPT) for coarse-to-fine temporal refinement and Spatial Ancestral Chains (SAC) for multi-scale spatial context. The framework lifts 2D priors to a deformable SE(3) motion-basis representation and models content with dynamic Gaussians that are blended and splatted, enabling differentiable rendering. Through parallel optimization and specialized losses, WorldTree achieves state-of-the-art results on NVIDIA-LS and DyCheck, notably improving LPIPS and mLPIPS in dynamic regions and demonstrating robust generalization to wild video data. The approach offers a practical path toward realistic 4D dynamic worlds from monocular video, with potential impact on AR/VR and video-heavy applications, while future work could further improve priors and reduce reliance on pre-trained models.

Abstract

Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.

WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains

TL;DR

WorldTree tackles monocular dynamic reconstruction by decomposing spatiotemporal dynamics with a Temporal Partition Tree (TPT) for coarse-to-fine temporal refinement and Spatial Ancestral Chains (SAC) for multi-scale spatial context. The framework lifts 2D priors to a deformable SE(3) motion-basis representation and models content with dynamic Gaussians that are blended and splatted, enabling differentiable rendering. Through parallel optimization and specialized losses, WorldTree achieves state-of-the-art results on NVIDIA-LS and DyCheck, notably improving LPIPS and mLPIPS in dynamic regions and demonstrating robust generalization to wild video data. The approach offers a practical path toward realistic 4D dynamic worlds from monocular video, with potential impact on AR/VR and video-heavy applications, while future work could further improve priors and reduce reliance on pre-trained models.

Abstract

Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.
Paper Structure (47 sections, 8 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 8 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Intuitive illustration of WorldTree.
  • Figure 2: WorldTree Pipeline. Our proposed method starts by extracting 2D prior results and then initializes the dynamic representation of the tree root. Furthermore, WorldTree builds TPT to achieve temporal coarse-to-fine optimization from the overall interval to the sub-interval of the video, and utilizes SAC to achieve the complementary spatial dynamic representation at the same time, thereby achieving high-quality dynamic reconstruction.
  • Figure 3: Qualitative comparisons with other methods on NVIDIA-LS nvidia.
  • Figure 4: Qualitative comparisons with other methods on DyCheck dycheck.
  • Figure 5: Quantitative and qualitative ablations on the NVIDIA-LS nvidia dataset.
  • ...and 8 more figures