Table of Contents
Fetching ...

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

Yuxin Yao, Zhi Deng, Junhui Hou

TL;DR

RigGS tackles articulated-object modeling from monocular videos by representing the object as a canonical 3D Gaussian set $\mathcal{G}=\{G_i:\mu_i,\mathbf{q}_i,\mathbf{s}_i,\sigma_i,sh_i\}$ deformed over time by a skeleton-aware field. It then constructs a dense-to-sparse skeleton via coarse-to-fine steps and binds the skeleton to $\mathcal{G}$ with learnable skinning weights and a pose-dependent detail module, trained with render losses $L_{\text{render}}^t$, projection losses $L_{\tilde{\text{proj}}}^t$, and regularizers $L_{\text{arap}}^t$ and $L_{\text{id}}^t$ to enable editing and new-motion generation. A coarse-to-fine skeleton construction uses mean trajectories to select a canonical shape, builds a minimum spanning tree from dense candidates, and prunes to a sparse, semantically informed tree with symmetry refinement via $\text{DINOv2}$ features. Experimental results on synthetic datasets (D-NeRF, DG-Mesh) and real MoCap data (ZJU-MoCap) show competitive rendering quality while enabling pose editing, interpolation, and motion transfer, though the approach relies on quality 2D skeletons and camera motion constraints.

Abstract

This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and re-posable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering.

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

TL;DR

RigGS tackles articulated-object modeling from monocular videos by representing the object as a canonical 3D Gaussian set deformed over time by a skeleton-aware field. It then constructs a dense-to-sparse skeleton via coarse-to-fine steps and binds the skeleton to with learnable skinning weights and a pose-dependent detail module, trained with render losses , projection losses , and regularizers and to enable editing and new-motion generation. A coarse-to-fine skeleton construction uses mean trajectories to select a canonical shape, builds a minimum spanning tree from dense candidates, and prunes to a sparse, semantically informed tree with symmetry refinement via features. Experimental results on synthetic datasets (D-NeRF, DG-Mesh) and real MoCap data (ZJU-MoCap) show competitive rendering quality while enabling pose editing, interpolation, and motion transfer, though the approach relies on quality 2D skeletons and camera motion constraints.

Abstract

This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and re-posable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering.

Paper Structure

This paper contains 16 sections, 18 equations, 18 figures, 6 tables, 2 algorithms.

Figures (18)

  • Figure 1: RigGS is a new and effective paradigm for automatically modeling articulated objects from 2D videos without any template prior. RigGS allows for easy editing and interpolation of object motion while supporting high-quality real-time rendering for these creative poses. We visualize the constructed skeleton (top-left), skinning weights (top-right), and edited new poses (bottom) for each object.
  • Figure 2: Overview of our RigGS. Initially, we construct a canonical 3D Gaussian, coupled with skeleton-aware node-controlled deformation, to begin the 4D reconstruction of the dynamic object. From the resulting skeleton-aware nodes, we then extract a sparse skeleton using a heuristic algorithm. Finally, leveraging the initialized deformation field and 3D Gaussians as starting values, we design learnable skinning weights and optimize a skeleton-driven deformation field. Our RigGS can be utilized for tasks such as editing, interpolation, and motion transfer, enabling real-time high-quality rendering of these new actions.
  • Figure 3: The process of the skeleton construction. The red circles mark the nodes that need to be removed or merged, and the yellow circles mark some locations selected as skeleton joints.
  • Figure 4: Novel view rendering and repose results via anisotropic or isotropic 3D Gaussians.
  • Figure 5: Comparison of visual results for different variants on 2D projection loss term.
  • ...and 13 more figures