DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao; Zhongang Qi; Cong Wang; Qingping Zheng; Guansong Lu; Fei Chen; Hang Xu; Zuxuan Wu

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao, Zhongang Qi, Cong Wang, Qingping Zheng, Guansong Lu, Fei Chen, Hang Xu, Zuxuan Wu

TL;DR

DynamiCtrl tackles pose-guided human image animation within diffusion-transformer backbones by unifying image and pose latent spaces through a Shared VAE encoder and introducing Pose-adaptive Layer Norm (PadaLN) to inject pose cues into the DiT. It further preserves semantic guidance via a Joint-text paradigm, enabling full-attention alignment of visual and textual cues and enabling multi-level, mask-guided control without architectural changes. Empirical results on TikTok and Unseen100 show state-of-the-art perceptual quality, strong identity preservation, and high-resolution outputs, validating both the approach and its potential for digital-human applications. The method demonstrates that Joint-text conditioning can unlock semantically rich, controllable video synthesis in diffusion-transformer architectures while avoiding extra pose-encoder training.

Abstract

With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

TL;DR

Abstract

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)