RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

Yuxin Yao; Zhi Deng; Junhui Hou

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

Yuxin Yao, Zhi Deng, Junhui Hou

TL;DR

RigGS tackles articulated-object modeling from monocular videos by representing the object as a canonical 3D Gaussian set $\mathcal{G}=\{G_i:\mu_i,\mathbf{q}_i,\mathbf{s}_i,\sigma_i,sh_i\}$ deformed over time by a skeleton-aware field. It then constructs a dense-to-sparse skeleton via coarse-to-fine steps and binds the skeleton to $\mathcal{G}$ with learnable skinning weights and a pose-dependent detail module, trained with render losses $L_{\text{render}}^t$, projection losses $L_{\tilde{\text{proj}}}^t$, and regularizers $L_{\text{arap}}^t$ and $L_{\text{id}}^t$ to enable editing and new-motion generation. A coarse-to-fine skeleton construction uses mean trajectories to select a canonical shape, builds a minimum spanning tree from dense candidates, and prunes to a sparse, semantically informed tree with symmetry refinement via $\text{DINOv2}$ features. Experimental results on synthetic datasets (D-NeRF, DG-Mesh) and real MoCap data (ZJU-MoCap) show competitive rendering quality while enabling pose editing, interpolation, and motion transfer, though the approach relies on quality 2D skeletons and camera motion constraints.

Abstract

This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and re-posable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering.

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

TL;DR

RigGS tackles articulated-object modeling from monocular videos by representing the object as a canonical 3D Gaussian set

deformed over time by a skeleton-aware field. It then constructs a dense-to-sparse skeleton via coarse-to-fine steps and binds the skeleton to

with learnable skinning weights and a pose-dependent detail module, trained with render losses

, projection losses

, and regularizers

and

to enable editing and new-motion generation. A coarse-to-fine skeleton construction uses mean trajectories to select a canonical shape, builds a minimum spanning tree from dense candidates, and prunes to a sparse, semantically informed tree with symmetry refinement via

features. Experimental results on synthetic datasets (D-NeRF, DG-Mesh) and real MoCap data (ZJU-MoCap) show competitive rendering quality while enabling pose editing, interpolation, and motion transfer, though the approach relies on quality 2D skeletons and camera motion constraints.

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

TL;DR

Abstract

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)