MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang, Guo Zhi Zhi, Weizhan Zhang
TL;DR
MotionWeaver tackles the challenge of multi-humanoid image animation by introducing a holistic 4D-anchored framework that decouples motion from morphology and fuses it with video latents in a shared 4D space. The Unified-Choreography Core extracts identity-agnostic motion tokens and binds them to character appearances, while the Hyper-Scene Integrator leverages depth-aware attention and Dynamic C-RoPE to model spatiotemporal relationships across multiple characters. Hierarchical-4D Supervision provides 4D-aware guidance at different diffusion timesteps, enhancing occlusion handling and motion fidelity. The approach is validated on the MultiHuman46 dataset and the DualDynamics benchmark, achieving state-of-the-art results and strong generalization to diverse humanoid forms and complex interactions, with broad implications for animation, virtual production, and synthetic data generation.
Abstract
Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
