Table of Contents
Fetching ...

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma, Yujian Zheng, Jing Wang, Zheng Chong, Xujie Zhang, Xianhang Cheng, Xiaodan Liang, Hao Li

Abstract

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Abstract

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...
Paper Structure (21 sections, 6 equations, 13 figures, 3 tables)

This paper contains 21 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: We propose AnyCrowd, a versatile framework for multi-character animation, which supports: (a) animation of an arbitrary number of characters sourced from either single or multiple reference images; (b) diverse configurations such as many-to-one (multiple poses driving one identity) or one-to-many (one pose driving multiple identities) generation; and (c) arbitrary assignments between IDs and pose sequences.
  • Figure 2: Failure cases of baseline method jiang2025vace in multi-character scenarios. (a) Identity-Pose Mis-Binding (red): identity swap. (b) Identity Entanglement (green): appearance blending. Boxes track specific identities across Reference, GT, and generated frames.
  • Figure 3: Overview of AnyCrowd. (a) Instance-Isolated Latent Representation (IILR): The reference image with $C$ identities is decoupled into $C+1$ isolated images and encoded into identity-decoupled reference tokens. (b) Architecture: AnyCrowd is built upon a dual-stream DiT architecture, where the Context and DiT branches process conditioning signals and perform iterative denoising. (c) Tri-Stage Decoupled Attention (TSDA): This mechanism facilitates explicit identity-pose binding during the self-attention process, incorporating an Adaptive Gated Fusion (AGF) module to adaptively fuse overlapping tokens from different categories.
  • Figure 4: (a, b) Character number distribution of MCD-7K (train) and MCD-300 (test). (c) Preference results on MCD-300 under cross-driven setting.
  • Figure 5: Qualitative comparison on MCD-300 under the self-driven setting. Red dashed boxes highlight typical artifacts in baseline results. Please zoom in for details.
  • ...and 8 more figures