Table of Contents
Fetching ...

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang, Chun-Han Yao, Simon Donné, Narendra Ahuja, Varun Jampani

TL;DR

SP4D tackles the challenge of generating temporally and view-consistent kinematic part decompositions together with RGB videos from monocular inputs. It proposes a dual-branch diffusion framework with Bidirectional Diffusion Fusion (BiDiFuse), a spatial color encoding scheme for encoder sharing, and a contrastive part consistency loss to align part representations across views and time. A lightweight 2D-to-kinematic mesh pipeline lifts 2D part maps to 3D and computes harmonic skinning weights, supported by the KinematicParts20K dataset of rigged objects. The approach generalizes to real-world data, novel objects, and rare poses, enabling animation-ready assets with minimal manual intervention. This work advances 4D generative modeling by explicitly modeling kinematic structure, with meaningful implications for animation, AR/VR, and robotics.

Abstract

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

TL;DR

SP4D tackles the challenge of generating temporally and view-consistent kinematic part decompositions together with RGB videos from monocular inputs. It proposes a dual-branch diffusion framework with Bidirectional Diffusion Fusion (BiDiFuse), a spatial color encoding scheme for encoder sharing, and a contrastive part consistency loss to align part representations across views and time. A lightweight 2D-to-kinematic mesh pipeline lifts 2D part maps to 3D and computes harmonic skinning weights, supported by the KinematicParts20K dataset of rigged objects. The approach generalizes to real-world data, novel objects, and rare poses, enabling animation-ready assets with minimal manual intervention. This work advances 4D generative modeling by explicitly modeling kinematic structure, with meaningful implications for animation, AR/VR, and robotics.

Abstract

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Paper Structure

This paper contains 28 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Left:Stable Part Diffusion 4D (SP4D) takes a monocular input video and generates novel-view RGB videos (bottom-left) as well as consistent part segmentation videos across all views. Right: SP4D also supports single image input and synthesizes multi-view RGB images and corresponding part decompositions. These results can be lifted to 3D to produce riggable meshes with part-aware geometry and articulated structure.
  • Figure 2: Limitations of traditional 2D and 3D part decomposition methods. Left: Appearance-based 2D segmentation methods like SAM2 fail to produce kinematic parts. Middle: SOTA 3D rigging methods song2025magicarticulate lack the capability to infer kinematic part structures from appearance and generalize poorly to diverse shapes. Right: Existing 3D part segmentation models tang2024segmentyang2024sampart3d focus on semantic regions and are not suited for kinematic decomposition.
  • Figure 3: Stable Part Diffusion 4D model architecture. Our model builds upon SV4D 2.0 and extends it with a parallel part segmentation branch and a BiDiFuse module that enables bidirectional feature exchange between RGB and part branches. The network jointly generates multi-view videos for appearance and kinematics-aware part segmentation. Key components include: (1) spatial color encoding for part masks, enabling shared VAE encoder/decoder; (2) BiDiFuse for cross-branch consistency; and (3) a contrastive loss for spatial-temporal part alignment. We use a two-stage training strategy: first, training the RGB branch on ObjaverseDy, then fine-tuning the full model with BiDiFuse on KinematicParts20K with supervision on both branches.
  • Figure 4: Multi-view kinematic part video results on synthetic and real-world videos. We show qualitative results of our SP4D model on both the validation set of KinematicParts20K and real-world DAVIS videos. Each group presents two time frames across two novel views. The input video frame is noted with purple boxes. SP4D produces temporally and spatially consistent part decompositions across diverse object categories and motions.
  • Figure 5: Visual comparison of part segmentation. We show results across three views for various articulated objects. The rows contain input RGB image (top), our SP4D-generated part segmentation (middle), and the SAM2 baseline (bottom). Compared to SAM2, SP4D produces more structured part decompositions that align with object articulation and are consistent across views.
  • ...and 1 more figures