Table of Contents
Fetching ...

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao, Zhongang Qi, Cong Wang, Qingping Zheng, Guansong Lu, Fei Chen, Hang Xu, Zuxuan Wu

TL;DR

DynamiCtrl tackles pose-guided human image animation within diffusion-transformer backbones by unifying image and pose latent spaces through a Shared VAE encoder and introducing Pose-adaptive Layer Norm (PadaLN) to inject pose cues into the DiT. It further preserves semantic guidance via a Joint-text paradigm, enabling full-attention alignment of visual and textual cues and enabling multi-level, mask-guided control without architectural changes. Empirical results on TikTok and Unseen100 show state-of-the-art perceptual quality, strong identity preservation, and high-resolution outputs, validating both the approach and its potential for digital-human applications. The method demonstrates that Joint-text conditioning can unlock semantically rich, controllable video synthesis in diffusion-transformer architectures while avoiding extra pose-encoder training.

Abstract

With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

TL;DR

DynamiCtrl tackles pose-guided human image animation within diffusion-transformer backbones by unifying image and pose latent spaces through a Shared VAE encoder and introducing Pose-adaptive Layer Norm (PadaLN) to inject pose cues into the DiT. It further preserves semantic guidance via a Joint-text paradigm, enabling full-attention alignment of visual and textual cues and enabling multi-level, mask-guided control without architectural changes. Empirical results on TikTok and Unseen100 show state-of-the-art perceptual quality, strong identity preservation, and high-resolution outputs, validating both the approach and its potential for digital-human applications. The method demonstrates that Joint-text conditioning can unlock semantically rich, controllable video synthesis in diffusion-transformer architectures while avoiding extra pose-encoder training.

Abstract

With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

Paper Structure

This paper contains 19 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Pose-guided human image animation results by DynamiCtrl in 1360$\times$768 and 1360$\times$1360 resolutions. We show generated frames (top) with different persons (bottom), driven by precise pose sequence.
  • Figure 2: Overview of the proposed DynamiCtrl framework for human image animation. In (b) and (c), the T2V, I2V, and P2V denote the text-guided, image-guided, and pose-guided video generation.
  • Figure 3: Vision-to-vision spatial and temporal attention visualizations of different control methods.
  • Figure 4: The "Joint-text" paradigm enables our model to achieve multi-level controllability, allowing not only precise human motion, but also fine-grained control over all elements in the image.
  • Figure 5: Qualitative comparisons with SOTAs on five challenging unseen examples. We use their released models for generation, and Animate-X tan2024animate is the latest open-source animation model.
  • ...and 5 more figures