Table of Contents
Fetching ...

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu

TL;DR

HuMo introduces a unified human-centric video generation framework conditioned on text, reference images, and audio. It solves data scarcity with a multimodal data pipeline and enables collaborative control through a two-stage progressive training paradigm and a time-adaptive CFG during inference. Key innovations include a minimal-invasive image injection for subject preservation, a focus-by-predicting strategy for audio-visual synchronization, and a curriculum that progressively integrates audio conditioning. Empirical results show HuMo surpasses specialized SOTA methods in subject preservation and audio-visual sync, validating its effectiveness across 1.7B and 17B-parameter backbones and its potential for flexible, multimodal-driven short video production.

Abstract

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

TL;DR

HuMo introduces a unified human-centric video generation framework conditioned on text, reference images, and audio. It solves data scarcity with a multimodal data pipeline and enables collaborative control through a two-stage progressive training paradigm and a time-adaptive CFG during inference. Key innovations include a minimal-invasive image injection for subject preservation, a focus-by-predicting strategy for audio-visual synchronization, and a curriculum that progressively integrates audio conditioning. Empirical results show HuMo surpasses specialized SOTA methods in subject preservation and audio-visual sync, validating its effectiveness across 1.7B and 17B-parameter backbones and its potential for flexible, multimodal-driven short video production.

Abstract

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

Paper Structure

This paper contains 14 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We propose HuMo, a multimodal HCVG framework that supports flexible input compositions of text-image (first row), text-audio (second row), and text-image-audio (third row). HuMo generalizes to humans, humans with objects or animals, stylized humanoid artworks, and animations.
  • Figure 2: Prior HCVG methods comparison. Reference images as inputs for Phantom phantom_2025, HunyuanCustom hunyuancustom_2025, and HuMo. OmniHuman-1 omnihuman_2025 takes the start frame as input, which is synthesized by an image generation model seedream30_2025banana2025. OmniHuman-1 suffers from weak text adherence, unable to generate subjects (e.g., a toy) absent in the start frame, while the subject preservation is precariously dependent on the preceding image generator. Phantom lacks audio-driven articulation to synchronize mouth movements with the spoken words in the input audio. HunyuanCustom delivers unbalanced performance on both fronts. HuMo excels in collaborative performance across video quality, subject consistency, audio-visual sync, and text controllability.
  • Figure 3: Overview of our framework. HuMo model (left) is trained based on the proposed data processing pipeline (right). Built upon a DiT-based T2V backbone from Stage 0, the model progressively learns subject preservation and audio-visual sync capabilities in Stages 1 and 2. HuMo achieves collaborative generation across different modality compositions.
  • Figure 4: The proposed time-adaptive CFG balances text guidance and identity preservation.
  • Figure 5: Qualitative comparison for the subject preservation task. Zoom in for details.
  • ...and 4 more figures