Table of Contents
Fetching ...

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

Yiming Liang, Tianhan Xu, Yuta Kikuchi

TL;DR

HiMoR addresses monocular dynamic 3D reconstruction by introducing a hierarchical motion representation that decomposes scene motion into coarse and fine components using a tree of Gaussian primitives. The relative SE(3) motion of each node is expressed through shared motion bases, enabling stable, low-rank motion modeling and detailed deformation via leaf-node interpolation, with node densification ensuring coverage of occluded regions. The method combines a rigorous loss design and perceptual metrics, achieving state-of-the-art results on challenging monocular videos and demonstrating improved temporal consistency and novel-view synthesis. The work contributes a scalable, structured deformation framework for Gaussians, emphasizing perceptual evaluation and offering practical insights for dynamic scene capture with monocular inputs.

Abstract

We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction. The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

TL;DR

HiMoR addresses monocular dynamic 3D reconstruction by introducing a hierarchical motion representation that decomposes scene motion into coarse and fine components using a tree of Gaussian primitives. The relative SE(3) motion of each node is expressed through shared motion bases, enabling stable, low-rank motion modeling and detailed deformation via leaf-node interpolation, with node densification ensuring coverage of occluded regions. The method combines a rigorous loss design and perceptual metrics, achieving state-of-the-art results on challenging monocular videos and demonstrating improved temporal consistency and novel-view synthesis. The work contributes a scalable, structured deformation framework for Gaussians, emphasizing perceptual evaluation and offering practical insights for dynamic scene capture with monocular inputs.

Abstract

We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction. The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.

Paper Structure

This paper contains 25 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview. RGB]210,222,238Left: The proposed hierarchical motion representation (HiMoR) is defined in the canonical frame with 3D Gaussian primitives. HiMoR uses a tree structure where each node represents the relative motion to its parent node, with the root node representing stationary motion that is fixed to the world coordinate origin. RGB]227,221,233Top right: Child nodes that belong to the same parent node share a set of $\mathbb{SE}(3)$ motion bases, and the motion of each child node is obtained by weighting the motion bases with its own coefficients. The motion of leaf nodes relative to the world coordinate is iteratively computed based on the hierarchy of HiMoR. RGB]213,233,228Bottom right: The deformation of each Gaussian is derived by weighting the motion of its K-nearest neighbor (KNN) leaf nodes within the canonical frame.
  • Figure 2: Visualizations of reference images and rendered results. Both SoM and ours resulted in some misalignment w.r.t. the ground truth: while SoM's result is noticeably broken, it achieves a higher PSNR due to the hand transparency; ours maintains integrity in both geometry and appearance, yet has a lower PSNR. We found that evaluating reconstruction quality using a perceptual metric (i.e., CLIP-I animate1242023) aligns more with human perception.
  • Figure 3: Qualitative results of novel view synthesis on the iPhone dataset dycheck2022. From the top are "Apple", "Block", "Paper-windmill", and "Teddy".
  • Figure 4: Qualitative comparison of temporal consistency at novel view on the iPhone dataset dycheck2022. The time interval of adjacent images is ten frames.
  • Figure 5: Motion decomposition via hierarchical structure. The sequences are rendered at a fixed camera. We extract (a) coarse motion by freezing second-level nodes and (b) fine motion by freezing first-level nodes. It can be observed that (a) models general movements of the arm and the backpack, whereas (b) captures subtle rotations of the backpack and the swing of the straps.
  • ...and 8 more figures