Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation
Zhoujie Fu, Jiacheng Wei, Wenhao Shen, Chaoyue Song, Xiaofeng Yang, Fayao Liu, Xulei Yang, Guosheng Lin
TL;DR
Sync4D tackles the challenge of controllable 4D generation by transferring motion from casually captured reference videos to generated 3D Gaussians. It combines blend skinning-based shape reconstruction, cross-modal shape correspondence, and a physics-driven MLS-MPM framework with a delta-velocity field optimized via a displacement loss to ensure temporal coherence and shape integrity. The method supports diverse references (humans, quadrupeds, articulated objects) and arbitrary-motion-length dynamics, outperforming diffusion-video-based baselines in motion similarity and shape consistency according to user studies. This yields high-fidelity, physically plausible 4D content suitable for VR, gaming, and simulation, while balancing limitations in topology transfer and initial pose alignment.
Abstract
In this work, we introduce a novel approach for creating controllable dynamics in 3D-generated Gaussians using casually captured reference videos. Our method transfers the motion of objects from reference videos to a variety of generated 3D Gaussians across different categories, ensuring precise and customizable motion transfer. We achieve this by employing blend skinning-based non-parametric shape reconstruction to extract the shape and motion of reference objects. This process involves segmenting the reference objects into motion-related parts based on skinning weights and establishing shape correspondences with generated target shapes. To address shape and temporal inconsistencies prevalent in existing methods, we integrate physical simulation, driving the target shapes with matched motion. This integration is optimized through a displacement loss to ensure reliable and genuine dynamics. Our approach supports diverse reference inputs, including humans, quadrupeds, and articulated objects, and can generate dynamics of arbitrary length, providing enhanced fidelity and applicability. Unlike methods heavily reliant on diffusion video generation models, our technique offers specific and high-quality motion transfer, maintaining both shape integrity and temporal consistency.
