Table of Contents
Fetching ...

MultiAnimate: Pose-Guided Image Animation Made Extensible

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu

TL;DR

This paper proposes an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation that achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

Abstract

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

MultiAnimate: Pose-Guided Image Animation Made Extensible

TL;DR

This paper proposes an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation that achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

Abstract

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
Paper Structure (14 sections, 10 figures, 2 tables)

This paper contains 14 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Multi-character pose-guided image animation generated by our framework. Our method performs multi-character image animation with consistent identity and appearance for each character. Notably, our framework, trained only on two-character data, is capable of producing identity-consistent three-person videos and can, in principle, be extended to scenarios with even more participants (e.g., seven characters).
  • Figure 2: Dilemmas of current methods in multi-character image animation.
  • Figure 3: In multi-character image animation, identical pose sequences can lead to multiple plausible motion trajectories.
  • Figure 4: Overview of our framework. Our pipeline contains two main streams: the reference stream, which encodes the reference image and its pose to capture appearance information, and the motion stream, which encodes multi-character pose sequences and tracking masks to model motion and spatial conditions. The two streams are fused through element-wise addition of latent tokens. The Identifier Assigner unifies per-person tracking masks into a structured label representation, preserving spatial relationships and interactions among multiple characters. This representation is converted to the feature space of the DiT backbone by the Identifier Adapter.
  • Figure 5: Our framework performs well at early training stages, but inconsistencies emerge when the person-assigned labels at inference differ from those seen during training.
  • ...and 5 more figures