Table of Contents
Fetching ...

Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation

Zhoujie Fu, Jiacheng Wei, Wenhao Shen, Chaoyue Song, Xiaofeng Yang, Fayao Liu, Xulei Yang, Guosheng Lin

TL;DR

Sync4D tackles the challenge of controllable 4D generation by transferring motion from casually captured reference videos to generated 3D Gaussians. It combines blend skinning-based shape reconstruction, cross-modal shape correspondence, and a physics-driven MLS-MPM framework with a delta-velocity field optimized via a displacement loss to ensure temporal coherence and shape integrity. The method supports diverse references (humans, quadrupeds, articulated objects) and arbitrary-motion-length dynamics, outperforming diffusion-video-based baselines in motion similarity and shape consistency according to user studies. This yields high-fidelity, physically plausible 4D content suitable for VR, gaming, and simulation, while balancing limitations in topology transfer and initial pose alignment.

Abstract

In this work, we introduce a novel approach for creating controllable dynamics in 3D-generated Gaussians using casually captured reference videos. Our method transfers the motion of objects from reference videos to a variety of generated 3D Gaussians across different categories, ensuring precise and customizable motion transfer. We achieve this by employing blend skinning-based non-parametric shape reconstruction to extract the shape and motion of reference objects. This process involves segmenting the reference objects into motion-related parts based on skinning weights and establishing shape correspondences with generated target shapes. To address shape and temporal inconsistencies prevalent in existing methods, we integrate physical simulation, driving the target shapes with matched motion. This integration is optimized through a displacement loss to ensure reliable and genuine dynamics. Our approach supports diverse reference inputs, including humans, quadrupeds, and articulated objects, and can generate dynamics of arbitrary length, providing enhanced fidelity and applicability. Unlike methods heavily reliant on diffusion video generation models, our technique offers specific and high-quality motion transfer, maintaining both shape integrity and temporal consistency.

Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation

TL;DR

Sync4D tackles the challenge of controllable 4D generation by transferring motion from casually captured reference videos to generated 3D Gaussians. It combines blend skinning-based shape reconstruction, cross-modal shape correspondence, and a physics-driven MLS-MPM framework with a delta-velocity field optimized via a displacement loss to ensure temporal coherence and shape integrity. The method supports diverse references (humans, quadrupeds, articulated objects) and arbitrary-motion-length dynamics, outperforming diffusion-video-based baselines in motion similarity and shape consistency according to user studies. This yields high-fidelity, physically plausible 4D content suitable for VR, gaming, and simulation, while balancing limitations in topology transfer and initial pose alignment.

Abstract

In this work, we introduce a novel approach for creating controllable dynamics in 3D-generated Gaussians using casually captured reference videos. Our method transfers the motion of objects from reference videos to a variety of generated 3D Gaussians across different categories, ensuring precise and customizable motion transfer. We achieve this by employing blend skinning-based non-parametric shape reconstruction to extract the shape and motion of reference objects. This process involves segmenting the reference objects into motion-related parts based on skinning weights and establishing shape correspondences with generated target shapes. To address shape and temporal inconsistencies prevalent in existing methods, we integrate physical simulation, driving the target shapes with matched motion. This integration is optimized through a displacement loss to ensure reliable and genuine dynamics. Our approach supports diverse reference inputs, including humans, quadrupeds, and articulated objects, and can generate dynamics of arbitrary length, providing enhanced fidelity and applicability. Unlike methods heavily reliant on diffusion video generation models, our technique offers specific and high-quality motion transfer, maintaining both shape integrity and temporal consistency.
Paper Structure (19 sections, 18 equations, 7 figures, 1 table)

This paper contains 19 sections, 18 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Our proposed method can create dynamics on various generated 3D Gaussians guided by the reference casual video.
  • Figure 2: Overview of Sync4D: Sync4D processes a reference video to derive a canonical shape and a bone-based motion sequence through reconstruction techniques. Meanwhile, given a text prompt or image prompt, we generate a 3D Gaussian object through diffusion models. The framework matches motion-related parts from the reconstructed shape to the generated shape and transfers the motion. This motion information is then initialized into the velocity physical signals. We employ a triplane representation to produce a delta velocity field to adjust physical signals. The velocity field for each part of the target is optimized using the differentiable Material Point Method (MPM) simulation. To ensure fidelity to the original, a displacement loss is designed to reduce cumulative errors and ensure plausible motions.
  • Figure 3: Comparative Analysis between Sync4D and Other Frameworks. On the left, the reference video alongside the edited video from DMT is displayed. The upper example shows a successful adaptation, whereas the lower example is deemed a failure due to continual alterations in shape and appearance across frames. On the right, the Sync4D outputs are highlighted, showcasing superior motion and shape consistency relative to other frameworks.
  • Figure 4: We present the qualitative results of our generated 3D dynamics with reference video frames. Our method generates dynamics that align with the reference motion while retaining the shape integrity and temporal consistency. Please check the video results in the supplementary materials for a more intuitive illustration.
  • Figure 5: Ablation study on the number of bones in reconstruction to segment motion-related parts. Upper Row: number of bones $B = 25$. Bottom Row: number of bones $B = 13$, indicating the minimum articulated parts. Color black indicates removed outliers.
  • ...and 2 more figures