Table of Contents
Fetching ...

RealisDance: Equip controllable character animation with realistic hands

Jingkai Zhou, Benzhi Wang, Weihua Chen, Jingqi Bai, Dongyang Li, Aixi Zhang, Hao Xu, Mingyang Yang, Fan Wang

TL;DR

RealisDance tackles the instability and poor hand fidelity in controllable character animation by introducing three pose streams (DWPose, SMPL-CS, HaMeR) and a pose gating mechanism to robustly fuse them. It adds a motion-enabled pose guidance network and pose shuffle augmentation to smooth video and improve robustness against corrupted pose inputs, while leveraging a reference UNet for consistency. The approach yields significantly better hand realism and video stability than prior methods, demonstrated through qualitative comparisons and ablations. Practical impact lies in more reliable, realistic, and smooth pose-controlled character animation for images-to-video pipelines.

Abstract

Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.

RealisDance: Equip controllable character animation with realistic hands

TL;DR

RealisDance tackles the instability and poor hand fidelity in controllable character animation by introducing three pose streams (DWPose, SMPL-CS, HaMeR) and a pose gating mechanism to robustly fuse them. It adds a motion-enabled pose guidance network and pose shuffle augmentation to smooth video and improve robustness against corrupted pose inputs, while leveraging a reference UNet for consistency. The approach yields significantly better hand realism and video stability than prior methods, demonstrated through qualitative comparisons and ablations. Practical impact lies in more reliable, realistic, and smooth pose-controlled character animation for images-to-video pipelines.

Abstract

Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.
Paper Structure (6 sections, 8 figures)

This paper contains 6 sections, 8 figures.

Figures (8)

  • Figure 1: Samples generated from our reproduced Animate Anyone. Animate Anyone suffers from unstable generation if the condition pose is corrupted, as shown in the first two rows. Also, even if the condition pose is correct, Animate Anyone generates blur and unrealistic hands, as shown in the last two rows.
  • Figure 2: Samples generated from RealisDance. As can be seen, the generated results achieve high-quality hands even for complex gestures.
  • Figure 3: Architecture of RealisDance. Thanks to multi-type poses, the pose gating module, the multi-layer pose guidance network, and the pose shuffle augmentation, RealisDance achieves robust generation, realistic hands, and smooth video.
  • Figure 4: Illustration of three types of poses. SMPL-CS integrates 3D, depth, and continuous semantic information, and HaMer provides accurate 3D gesture estimation.
  • Figure 5: Architecture of pose gating module. In practice, we implement three individual condition encoders using one encoder with grouped convolution for faster speed.
  • ...and 3 more figures