Table of Contents
Fetching ...

DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiaoxiao Long, Ruqi Huang

TL;DR

DanceTogether tackles the challenge of identity-preserving multi-person video generation under noisy control signals by proposing an end-to-end diffusion framework that binds identity and action through MaskPoseAdapter and MultiFaceEncoder. The method fuses per-person masks with pose cues, injects compact identity tokens, and leverages a diffusion backbone to produce coherent, long-range interactions from a single reference image. It introduces PairFS-4K and HumanRob-300 datasets and TogetherVideoBench to rigorously evaluate identity-consistency, interaction-coherence, and video quality, reporting significant gains over prior work and good generalization to human–robot interactions. This work enables compositionally controllable, multi-actor video synthesis with broad implications for digital production, simulation, and embodied AI, while acknowledging limitations in group size, input-signal reliability, and deployment safeguards.

Abstract

Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

TL;DR

DanceTogether tackles the challenge of identity-preserving multi-person video generation under noisy control signals by proposing an end-to-end diffusion framework that binds identity and action through MaskPoseAdapter and MultiFaceEncoder. The method fuses per-person masks with pose cues, injects compact identity tokens, and leverages a diffusion backbone to produce coherent, long-range interactions from a single reference image. It introduces PairFS-4K and HumanRob-300 datasets and TogetherVideoBench to rigorously evaluate identity-consistency, interaction-coherence, and video quality, reporting significant gains over prior work and good generalization to human–robot interactions. This work enables compositionally controllable, multi-actor video synthesis with broad implications for digital production, simulation, and embodied AI, while acknowledging limitations in group size, input-signal reliability, and deployment safeguards.

Abstract

Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

Paper Structure

This paper contains 38 sections, 18 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: DanceTogether generates complex two-person interaction videos with interactive details and consistent identity preservation from a single reference image (see the left-most of each row), using independent multi-person pose and mask sequences as control signals.
  • Figure 2: DanceTogether pipeline overview: A single reference image and per-person pose/mask sequences enter the system; the MaskPoseAdapter fuses these control signals, the MultiFace Encoder injects identity tokens, and the video-diffusion backbone synthesizes an interaction video that preserves consistent identities for all actors.
  • Figure 3: Data Curation Pipeline Overview. Our pipeline processes raw videos through human tracking, mask generation with SAMURAI samba, pose estimation with DW-Pose dwpose, and alpha matting to produce per-person annotations.
  • Figure 4: The RGB image in the “Ref Image” row is the input reference frame, and the two pose maps in that row correspond to the inference results shown immediately below. All baselines exhibit severe identity drift, loss of interaction details, or even missing subjects when dealing with position exchanges and complex interactive poses. For additional qualitative results, please refer to Appendix Fig. \ref{['fig:animation_result']} and Fig. \ref{['fig:animation_result2']}.
  • Figure 5: Ablation study animation results (1/2).
  • ...and 16 more figures