Table of Contents
Fetching ...

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control

Hao Zhang, Di Chang, Fang Li, Mohammad Soleymani, Narendra Ahuja

TL;DR

MagicPose4D tackles the difficulty of precise articulated motion in 4D content by introducing monocular video prompts for motion control and a dual-phase reconstruction that distinguishes shape learning from physically plausible motion. The framework couples a skeleton-based representation with a global-local Chamfer loss to align 3D predictions with priors while preserving part-level accuracy, and then applies a non-training-based cross-category motion transfer to adapt motions across identities. Empirical results show improved 4D generation quality, enhanced temporal consistency, and robust cross-species motion transfer across multiple benchmarks, outperforming several state-of-the-art methods. This approach promises more controllable, expressive 4D content generation in animation, gaming, and virtual avatars, with attention to generalization and pragmatic training requirements.

Abstract

With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike current 4D generation methods, MagicPose4D accepts monocular videos or mesh sequences as motion prompts, enabling precise and customizable motion control. MagicPose4D comprises two key modules: (i) Dual-Phase 4D Reconstruction Module, which operates in two phases. The first phase focuses on capturing the model's shape using accurate 2D supervision and less accurate but geometrically informative 3D pseudo-supervision without imposing skeleton constraints. The second phase extracts the 3D motion (skeleton poses) using more accurate pseudo-3D supervision, obtained in the first phase and introduces kinematic chain-based skeleton constraints to ensure physical plausibility. Additionally, we propose a Global-local Chamfer loss that aligns the overall distribution of predicted mesh vertices with the supervision while maintaining part-level alignment without extra annotations. (ii) Cross-category Motion Transfer Module, which leverages the extracted motion from the 4D reconstruction module and uses a kinematic-chain-based skeleton to achieve cross-category motion transfer. It ensures smooth transitions between frames through dynamic rigidity, facilitating robust generalization without additional training. Through extensive experiments, we demonstrate that MagicPose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks.

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control

TL;DR

MagicPose4D tackles the difficulty of precise articulated motion in 4D content by introducing monocular video prompts for motion control and a dual-phase reconstruction that distinguishes shape learning from physically plausible motion. The framework couples a skeleton-based representation with a global-local Chamfer loss to align 3D predictions with priors while preserving part-level accuracy, and then applies a non-training-based cross-category motion transfer to adapt motions across identities. Empirical results show improved 4D generation quality, enhanced temporal consistency, and robust cross-species motion transfer across multiple benchmarks, outperforming several state-of-the-art methods. This approach promises more controllable, expressive 4D content generation in animation, gaming, and virtual avatars, with attention to generalization and pragmatic training requirements.

Abstract

With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike current 4D generation methods, MagicPose4D accepts monocular videos or mesh sequences as motion prompts, enabling precise and customizable motion control. MagicPose4D comprises two key modules: (i) Dual-Phase 4D Reconstruction Module, which operates in two phases. The first phase focuses on capturing the model's shape using accurate 2D supervision and less accurate but geometrically informative 3D pseudo-supervision without imposing skeleton constraints. The second phase extracts the 3D motion (skeleton poses) using more accurate pseudo-3D supervision, obtained in the first phase and introduces kinematic chain-based skeleton constraints to ensure physical plausibility. Additionally, we propose a Global-local Chamfer loss that aligns the overall distribution of predicted mesh vertices with the supervision while maintaining part-level alignment without extra annotations. (ii) Cross-category Motion Transfer Module, which leverages the extracted motion from the 4D reconstruction module and uses a kinematic-chain-based skeleton to achieve cross-category motion transfer. It ensures smooth transitions between frames through dynamic rigidity, facilitating robust generalization without additional training. Through extensive experiments, we demonstrate that MagicPose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks.
Paper Structure (26 sections, 9 equations, 12 figures, 6 tables)

This paper contains 26 sections, 9 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of MagicPose4D. MagicPose4D takes motion prompts (monocular video or dynamic mesh sequence) and appearance prompts (text or image) to control 4D content generation. The dual-phase 4D reconstruction module extracts motion references, while the cross-category motion transfer module applies these motions to different target objects while ensuring temporal consistency.
  • Figure 2: Overview of Dual-Phase 4D Reconstruction Module.Given a monocular video, the Reconstruction Module extracts a physically plausible skeleton sequence by reconstructing a 3D mesh sequence. The first phase focuses on capturing the object's shape, using a non-kinematic skeleton with learnable skinning weights for greater deformation flexibility. The second phase refines the skeleton for motion transfer, employing kinematic chains and heat diffusion-based skinning weights to ensure structural plausibility. Supervision shifts from a mix of 2D and pseudo-3D in the first phase to solely 3D pseudo-supervision in the second phase.
  • Figure 3: Overview of Cross-Category Motion Transfer Module. In the second phase of 4D Reconstruction, the skeletal motion of reference meshes is extracted. The skeletonization module derives the canonical skeleton (at time = 0), which remains fixed along with skinning weights, allowing pose control via pre-frame angles and bone scales. A skeleton template is embedded using a graph-based approach if it is available; otherwise, skeleton extraction methods are applied. The learned motion parameters are then transferred to the target object using forward kinematics, and blend skinning is applied to generate the deformed meshes.
  • Figure 4: Appearance and Motion Controlled 4D Generation. MagicPose4D can take either dynamic mesh sequences or monocular videos as motion prompts. These reference motions can be transferred to both humanoid and animal target identities.
  • Figure 5: 4D Generation. a) and b): Comparison of MagicPose4D with Animate124 zhao2023animate124, Motion Transfer. c): Comparison of MagicPose4D with 3D-CoreNet song20213d and X-DualNet 10076900. Videos are in Sec.\ref{['video']}.
  • ...and 7 more figures