DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model
Yucheng Xing, Jinxing Yin, Xiaodong Liu
TL;DR
DANCER tackles realistic single-person dance video synthesis by introducing two conditioning modules, AEM and PRM, within a latent-diffusion framework built on Stable Video Diffusion. AEM enriches reference-image appearance by fusing high- and low-level features, while PRM augments pose guidance with multi-domain cues derived from the source video using Sapiens, improving motion fidelity and temporal coherence. The authors also collect TikTok-3K, a large-scale dance video dataset, to bolster training and generalization. Empirical results on a TikTok-based test set show that DANCER surpasses state-of-the-art methods in both image- and video-level metrics, with ablation studies confirming the contributions of AEM, PRM, and additional data. The work provides a practical, scalable approach for high-quality dance video generation and offers a valuable dataset for future research in dance video synthesis.
Abstract
Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.
