Table of Contents
Fetching ...

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin

TL;DR

The paper tackles multi-concept human animation with per-identity appearance and audio conditioning by introducing InterActHuman, a diffusion-based framework that uses a mask predictor to explicitly align layout with reference appearances and iteratively injects local audio within region-specific masks. It couples appearance injection via self-attention in a DiT backbone with a per-layer mask predictor, enabling precise spatiotemporal conditioning across identities. To support learning, the authors build a large identity-aware dataset (~2.6M triplets) with per-frame masks and captions to supervise layout and appearance. Empirical results show state-of-the-art lip synchronization, subject fidelity, and motion diversity in both single- and multi-person scenarios, particularly excelling in multi-concept customization where multiple references interact coherently.

Abstract

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

TL;DR

The paper tackles multi-concept human animation with per-identity appearance and audio conditioning by introducing InterActHuman, a diffusion-based framework that uses a mask predictor to explicitly align layout with reference appearances and iteratively injects local audio within region-specific masks. It couples appearance injection via self-attention in a DiT backbone with a per-layer mask predictor, enabling precise spatiotemporal conditioning across identities. To support learning, the authors build a large identity-aware dataset (~2.6M triplets) with per-frame masks and captions to supervise layout and appearance. Empirical results show state-of-the-art lip synchronization, subject fidelity, and motion diversity in both single- and multi-person scenarios, particularly excelling in multi-concept customization where multiple references interact coherently.

Abstract

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Video frames generated from audio and multi-concept reference images (human heads/full bodies, objects, scenes) display rich, audio-matched expressions. Our method enables compositional generation including outfit changes, human–object interactions, anime styles, dialogues even without a start frame. Red and green wave icons denote speaking and listening, respectively.
  • Figure 2: Illustration of our framework, which adaptively predicts masks as the spatial guidance of audio condition injection. In training, we train the mask predictor (cross-attn w/ MLP) with mask loss; in inference, we collect mask predictions to cache and leverage masks predicted from the last denoising step ($t-1$) to guide the audio cross-attn in the current denoising step ($t$).
  • Figure 3: Qualitative comparison with previous methods on multi-concept audio injection.
  • Figure 4: Qualitative comparison with previous methods on subject consistency and text following.
  • Figure 5: Qualitative ablation on audio injection strategies.