Table of Contents
Fetching ...

MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

Haopeng Fang, Di Qiu, Binjie Mao, He Tang

TL;DR

MotionCharacter tackles identity preservation and fine-grained motion control in text-to-video generation. It introduces an ID-Preserving Adapter and a Motion Control Module, augmented by Region-Aware and ID-Consistency losses, and leverages the Human-Motion dataset with optical-flow-derived motion intensity to guide training. The approach enables identity-consistent videos that accurately follow nuanced actions and allows intuitive motion scaling without per-identity retraining. Experimental results and user studies show improved identity fidelity, motion adherence, and visual quality over baseline methods.

Abstract

Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.

MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

TL;DR

MotionCharacter tackles identity preservation and fine-grained motion control in text-to-video generation. It introduces an ID-Preserving Adapter and a Motion Control Module, augmented by Region-Aware and ID-Consistency losses, and leverages the Human-Motion dataset with optical-flow-derived motion intensity to guide training. The approach enables identity-consistent videos that accurately follow nuanced actions and allows intuitive motion scaling without per-identity retraining. Experimental results and user studies show improved identity fidelity, motion adherence, and visual quality over baseline methods.

Abstract

Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.

Paper Structure

This paper contains 19 sections, 11 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Given a single reference facial image, MotionCharacter can generate identity-consistent video outputs across text prompts, action phrases, and motion intensities. The upper section demonstrates its capability to accurately follow specific action phrases, while the lower section highlights its fine-grained motion control achieved by varying user-defined motion intensities.
  • Figure 2: Framework overview. Our proposed framework comprises three core components: the ID-Preserving Module, the Motion Control Module, and a composite loss function. The loss function incorporates a Region-Aware Loss to ensure high motion fidelity and an ID-Consistency Loss to maintain alignment with the reference ID image. During training, motion intensity $\mathcal{M}$ is derived from optical flow. At inference, human animations are generated based on user-defined motion intensity $\mathcal{M}$ and specified action phrases, enabling fine-grained and controllable video synthesis.
  • Figure 3: Qualitative Comparison. Comparison of our method with other approaches across diverse prompts and unseen reference images, encompassing various identities (male, female, celebrity, non-celebrity). Each column represents a unique identity and action phrase, with motion intensity fixed at 20 for clarity. "null" indicates a blank action phrase. Key prompt elements are highlighted in underline to emphasize specific actions or descriptors. For other methods, the action phrase and motion intensity are incorporated with the prompt to guide generation. To simplify notation, we abbreviated method names on the far left by omitting the common "FaceID" field, resulting in labels like IPA-Portrait.
  • Figure 4: User study results comparing our method with baselines across three evaluation criteria: identity consistency, motion controllability, and overall video quality.
  • Figure 5: Ablation study on the effects of Region-Aware Loss $\mathcal{L}_{R}$ and ID-Consistency Loss $\mathcal{L}_{id}$.
  • ...and 2 more figures