Table of Contents
Fetching ...

SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

Lifan Wu, Ruijie Zhu, Yubo Ai, Tianzhu Zhang

TL;DR

SkeletonGaussian tackles editable 4D generation from monocular video by introducing a skeleton-driven, hierarchical deformation model for Gaussian splatting. The method splits motion into rigid skeleton-driven deformation via linear blend skinning and fine non-rigid refinement with a HexPlane+MLP, enabling direct pose-based editing and better interpretability. It uses a static 3D Gaussian initial object, followed by rigid and non-rigid stages, trained with MV-SDS, reconstruction, and mask losses, achieving higher quality than prior methods on Consistent4D. The approach integrates smoothly with animation pipelines and enables real-time editing, offering a practical pathway for controllable 4D motion synthesis.

Abstract

4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/

SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

TL;DR

SkeletonGaussian tackles editable 4D generation from monocular video by introducing a skeleton-driven, hierarchical deformation model for Gaussian splatting. The method splits motion into rigid skeleton-driven deformation via linear blend skinning and fine non-rigid refinement with a HexPlane+MLP, enabling direct pose-based editing and better interpretability. It uses a static 3D Gaussian initial object, followed by rigid and non-rigid stages, trained with MV-SDS, reconstruction, and mask losses, achieving higher quality than prior methods on Consistent4D. The approach integrates smoothly with animation pipelines and enables real-time editing, offering a practical pathway for controllable 4D motion synthesis.

Abstract

4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/
Paper Structure (19 sections, 15 equations, 11 figures, 4 tables)

This paper contains 19 sections, 15 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Given (a) an input monocular video, we propose a novel 4D generation method SkeletonGaussian which uses (b) a skeleton to drive the motion of 4D Gaussian model. SkeletonGaussian enables (c) direct motion editing through the skeleton's explicit motion representation, allowing users to adjust skeleton poses to modify the motion of the objects directly.
  • Figure 2: Pipeline of the SkeletonGaussian framework for 4D object generation, divided into three stages: (1) Static 3D Object Generation and Skeleton Extraction: Starting from a frame at the video's midpoint, a static 3D Gaussian model $\mathcal{G}_{c}$ (\ref{['sec:static_3d_generation']}) is generated in canonical space, from which an inherent skeletal structure is subsequently extracted. (2) Rigid Motion Modeling: Using LBS, rigid deformations $\mathcal{F}_{lbs}$ (\ref{['sec:rigid_lbs_deformation']}) under various poses $\theta_t$ are applied to rigidly deform $\mathcal{G}_c$ into $\mathcal{G}_r$. During this stage, the skeleton poses $\theta_t$ are optimized. (3) Non-Rigid Motion Modeling: To capture fine-grained deformations, a deformation field $\mathcal{F}_{nr}$ (\ref{['sec:non_rigid_refinement']}) refines the motion of the rigidly deformed 3D Gaussian $\mathcal{G}_r$, transforming it into the observation space Gaussian $\mathcal{G}_o$. $\mathcal{F}_{nr}$ comprises a hexplane cao2023hexplane and an MLP. All three stages share the same Training Objectives (\ref{['sec:training_objectives']}). A differentiable Gaussian rasterizer renders images of the observation space 3D Gaussian $\mathcal{G}_o$ from multiple viewpoints, comparing them to the reference video with photometric and MV-SDS losses for backpropagation.
  • Figure 3: Visualizing 4D Object Motion with Skeleton Poses. We present generated 4D object motion and its corresponding skeleton poses, where the viewpoint rotates from left to right, and time progresses linearly from left to right.
  • Figure 4: Editing Generated Motion. We visualize the generated motion (top) and edited motion sequence (bottom). Users can directly adjust the skeleton poses of specific joints at different times to edit the object's motion.
  • Figure 5: Qualitative Comparisons. We compare our method with STAG4D zeng2024stag4d and DreamGaussian4D ren2023dreamgaussian4d. For each instance, we render two viewpoints at two time steps. We also visualize the skeleton poses of SkeletonGaussian.
  • ...and 6 more figures