Table of Contents
Fetching ...

MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang, Guo Zhi Zhi, Weizhan Zhang

TL;DR

MotionWeaver tackles the challenge of multi-humanoid image animation by introducing a holistic 4D-anchored framework that decouples motion from morphology and fuses it with video latents in a shared 4D space. The Unified-Choreography Core extracts identity-agnostic motion tokens and binds them to character appearances, while the Hyper-Scene Integrator leverages depth-aware attention and Dynamic C-RoPE to model spatiotemporal relationships across multiple characters. Hierarchical-4D Supervision provides 4D-aware guidance at different diffusion timesteps, enhancing occlusion handling and motion fidelity. The approach is validated on the MultiHuman46 dataset and the DualDynamics benchmark, achieving state-of-the-art results and strong generalization to diverse humanoid forms and complex interactions, with broad implications for animation, virtual production, and synthetic data generation.

Abstract

Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.

MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

TL;DR

MotionWeaver tackles the challenge of multi-humanoid image animation by introducing a holistic 4D-anchored framework that decouples motion from morphology and fuses it with video latents in a shared 4D space. The Unified-Choreography Core extracts identity-agnostic motion tokens and binds them to character appearances, while the Hyper-Scene Integrator leverages depth-aware attention and Dynamic C-RoPE to model spatiotemporal relationships across multiple characters. Hierarchical-4D Supervision provides 4D-aware guidance at different diffusion timesteps, enhancing occlusion handling and motion fidelity. The approach is validated on the MultiHuman46 dataset and the DualDynamics benchmark, achieving state-of-the-art results and strong generalization to diverse humanoid forms and complex interactions, with broad implications for animation, virtual production, and synthetic data generation.

Abstract

Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
Paper Structure (36 sections, 8 equations, 23 figures, 2 tables)

This paper contains 36 sections, 8 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: We propose MotionWeaver, a novel framework for multi-humanoid image animation, which effectively handles occlusions and complex interactions in multi-character scenarios, while showing strong generalization across diverse humanoid characters and artistic styles.
  • Figure 2: The overview of our MotionWeaver. (a) Unified-Choreography Core extracts unified motion representations $(z_{\textit{uni}})$. (b) Hyper-Scene Integrator integrates the motion representations with video latents within a shared 4D space. (c) Hierarchical-4D Supervision utilizes timestep-specific 4D supervision to help the model effectively learn motion representations.
  • Figure 3: Qualitative Comparison with Existing Methods. The yellow and red meshes indicate the target motions of the left and right characters from the reference image, respectively. Our MotionWeaver method achieves superior identity preservation and motion accuracy for multiple humanoid characters. Notably, it is the only approach that correctly handles dense inter-character interactions and occlusions.
  • Figure 4: Visual Comparison of Ablation Results. (a) The yellow and red meshes represent the target motions of the red and blue characters, respectively. The original design achieves the best visual performance among all variants. (b) Visualization of attention maps between frame latents and per-character unified motion representations.
  • Figure 5: MotionWeaver can indeed process scenes with more than two humanoids.
  • ...and 18 more figures