Table of Contents
Fetching ...

EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

Haotian Ling, Zequn Chen, Qiuying Chen, Donglin Di, Yongjia Ma, Hao Li, Chen Wei, Zhulin Tao, Xun Yang

TL;DR

EverybodyDance tackles identity confusion in multi-character animation by explicitly modeling Identity Correspondence with the Identity Matching Graph (IMG) and edge affinities from Mask–Query Attention (MQA). It enhances IC robustness through Identity Embedded Guidance (IEG), Multi-Scale Matching (MSM), and Pre-Classified Sampling (PCS), and introduces the ICE benchmark for evaluating IC in complex scenes. Empirical results show superior IC accuracy and video fidelity over state-of-the-art baselines, including cross-scenario generalization and motion transfer. The work highlights a graph-based paradigm for disentangling multiple characters and guiding accurate identity mapping in diffusion-based video generation.

Abstract

Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

TL;DR

EverybodyDance tackles identity confusion in multi-character animation by explicitly modeling Identity Correspondence with the Identity Matching Graph (IMG) and edge affinities from Mask–Query Attention (MQA). It enhances IC robustness through Identity Embedded Guidance (IEG), Multi-Scale Matching (MSM), and Pre-Classified Sampling (PCS), and introduces the ICE benchmark for evaluating IC in complex scenes. Empirical results show superior IC accuracy and video fidelity over state-of-the-art baselines, including cross-scenario generalization and motion transfer. The work highlights a graph-based paradigm for disentangling multiple characters and guiding accurate identity mapping in diffusion-based video generation.

Abstract

Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The left panel highlights the challenges of extending existing methods to multi-character scenarios. The yellow box indicates feature interference between characters, while the red box marks identity mismatches. The right panel illustrates our method's accurate identity correspondence.
  • Figure 2: The left panel illustrates what is the IMG. The target (tgt) frame indicates the ground truth correspondence, indicating which edges belong to the set $\mathcal{M}^*$. The right panel shows how we build the IMG. Since the regions in $\mathcal{R}$ do not overlap spatially, we sum all $\{r_i\}_{i=1}^{m}$ representations into $r_{\text{all}}$.
  • Figure 3: Training pipeline of EverybodyDance. We only construct the IMG during training. We additionally input the IEG of the reference image. ReferenceNet binds character identity by fusing the reference image's appearance to create identity-aware features that guide the DenoisingNet.
  • Figure 4: We compare our method with several state-of-the-art baselines. The last three rows illustrate three particularly challenging scenarios: (1) reference images exhibiting complex, non-standard poses; (2) target poses involving fewer character than the corresponding reference images; and (3) reference characters undergoing severe occlusion. Under these difficult conditions, our method consistently outperforms existing approaches, demonstrating accurate IC.
  • Figure 5: Qualitative comparison against different variants. Red boxes highlight cases of identity switch, while yellow boxes indicate instances of feature contamination.
  • ...and 2 more figures