Everybody Dance Now
Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros
TL;DR
The paper tackles cross-subject motion transfer by using 2D pose as an intermediate representation to drive a target subject's appearance in video. It introduces a three-stage pipeline—pose detection, global pose normalization, and pose-to-video translation—with temporal coherence and a dedicated face refinement GAN, trained via adversarial and perceptual losses. A key contribution is the integration of a temporal modeling strategy and a forensic detector to distinguish synthesized from real content, along with an open dataset of motion transfer videos. Thorough quantitative and qualitative evaluations, plus ablations, demonstrate improved realism over baselines. Limitations include artifacts from clothing/hair and pose-detection gaps, with future work aiming to broaden pose coverage and improve normalization across camera setups.
Abstract
This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.
