Table of Contents
Fetching ...

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition

Zhongcong Xu, Chaoyue Song, Guoxian Song, Jianfeng Zhang, Jun Hao Liew, Hongyi Xu, You Xie, Linjie Luo, Guosheng Lin, Jiashi Feng, Mike Zheng Shou

TL;DR

HIA tackles the challenge of high-quality human image animation by addressing two core gaps: fidelity in small but critical regions (face and hands) and realistic motion blur, which prior diffusion-based methods often overlook. It introduces regional supervision with targeted losses for face and hands, and explicitly models hand motion blur via hand movement vectors and sharpness cues, integrated into ControlNet guidance. Coupled with shifted SNR and a progressive training regime, HIA achieves state-of-the-art results on both the HumanDance and TikTok datasets, significantly improving reconstruction accuracy ($L1$) and perceptual quality ($FVD$) over strong baselines. The approach employs a multi-stage training pipeline and inference-time techniques (initial reference noise, animation-cfg, prompt traveling) to deliver robust, high-resolution, temporally coherent animations with strong generalization capabilities.

Abstract

Recent advances in video diffusion models have enabled realistic and controllable human image animation with temporal coherence. Although generating reasonable results, existing methods often overlook the need for regional supervision in crucial areas such as the face and hands, and neglect the explicit modeling for motion blur, leading to unrealistic low-quality synthesis. To address these limitations, we first leverage regional supervision for detailed regions to enhance face and hand faithfulness. Second, we model the motion blur explicitly to further improve the appearance quality. Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity. Experimental results demonstrate that our proposed method outperforms state-of-the-art approaches, achieving significant improvements upon the strongest baseline by more than 21.0% and 57.4% in terms of reconstruction precision (L1) and perceptual quality (FVD) on HumanDance dataset. Code and model will be made available.

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition

TL;DR

HIA tackles the challenge of high-quality human image animation by addressing two core gaps: fidelity in small but critical regions (face and hands) and realistic motion blur, which prior diffusion-based methods often overlook. It introduces regional supervision with targeted losses for face and hands, and explicitly models hand motion blur via hand movement vectors and sharpness cues, integrated into ControlNet guidance. Coupled with shifted SNR and a progressive training regime, HIA achieves state-of-the-art results on both the HumanDance and TikTok datasets, significantly improving reconstruction accuracy () and perceptual quality () over strong baselines. The approach employs a multi-stage training pipeline and inference-time techniques (initial reference noise, animation-cfg, prompt traveling) to deliver robust, high-resolution, temporally coherent animations with strong generalization capabilities.

Abstract

Recent advances in video diffusion models have enabled realistic and controllable human image animation with temporal coherence. Although generating reasonable results, existing methods often overlook the need for regional supervision in crucial areas such as the face and hands, and neglect the explicit modeling for motion blur, leading to unrealistic low-quality synthesis. To address these limitations, we first leverage regional supervision for detailed regions to enhance face and hand faithfulness. Second, we model the motion blur explicitly to further improve the appearance quality. Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity. Experimental results demonstrate that our proposed method outperforms state-of-the-art approaches, achieving significant improvements upon the strongest baseline by more than 21.0% and 57.4% in terms of reconstruction precision (L1) and perceptual quality (FVD) on HumanDance dataset. Code and model will be made available.
Paper Structure (22 sections, 4 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 4 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: We introduce HIA, a high-quality human image animation framework designed to generate realistic results, particularly for small-scale regions such as faces and hands. Our approach incorporates explicit conditioning on the motion blur of hands, enabling precise control over hand sharpness. We overlay the motion signal and motion blur condition on the top left and top right corners of each synthesized video frame respectively.
  • Figure 2: Given a random noisy latent, a reference image, a motion sequence, and motion blur condition, our model synthesizes the avatar using the identity and background from the reference image and animates the avatar adhering to the provided motion sequence (left panel). To enhance the quality of the face and hands, we devise a regional supervision stage that fine-tunes appearance encoder with MSE and cosine similarity loss terms ( right panel).
  • Figure 3: Qualitative comparisons between ours and baselines on two datasets. The driving signal is overlaid in the upper left corner of each frame. Errors in the baseline methods are highlighted in red boxes. Please refer to our project page in Sup. Mat. for video results.
  • Figure 4: Qualitative comparisons between ours and baselines on unseen categories, i.e., humanoid and oil painting portraits. Errors in the baseline methods are highlighted in orange boxes. Please refer to our project page in Sup. Mat. for video results.
  • Figure 5: Visualization of ablation studies, with errors highlighted in red boxes. Each frame includes an overlay of the target pose in the bottom left or top right corner for reference.
  • ...and 4 more figures