RealisDance: Equip controllable character animation with realistic hands
Jingkai Zhou, Benzhi Wang, Weihua Chen, Jingqi Bai, Dongyang Li, Aixi Zhang, Hao Xu, Mingyang Yang, Fan Wang
TL;DR
RealisDance tackles the instability and poor hand fidelity in controllable character animation by introducing three pose streams (DWPose, SMPL-CS, HaMeR) and a pose gating mechanism to robustly fuse them. It adds a motion-enabled pose guidance network and pose shuffle augmentation to smooth video and improve robustness against corrupted pose inputs, while leveraging a reference UNet for consistency. The approach yields significantly better hand realism and video stability than prior methods, demonstrated through qualitative comparisons and ablations. Practical impact lies in more reliable, realistic, and smooth pose-controlled character animation for images-to-video pipelines.
Abstract
Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.
