Table of Contents
Fetching ...

Everybody Dance Now

Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros

TL;DR

The paper tackles cross-subject motion transfer by using 2D pose as an intermediate representation to drive a target subject's appearance in video. It introduces a three-stage pipeline—pose detection, global pose normalization, and pose-to-video translation—with temporal coherence and a dedicated face refinement GAN, trained via adversarial and perceptual losses. A key contribution is the integration of a temporal modeling strategy and a forensic detector to distinguish synthesized from real content, along with an open dataset of motion transfer videos. Thorough quantitative and qualitative evaluations, plus ablations, demonstrate improved realism over baselines. Limitations include artifacts from clothing/hair and pose-detection gaps, with future work aiming to broaden pose coverage and improve normalization across camera setups.

Abstract

This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.

Everybody Dance Now

TL;DR

The paper tackles cross-subject motion transfer by using 2D pose as an intermediate representation to drive a target subject's appearance in video. It introduces a three-stage pipeline—pose detection, global pose normalization, and pose-to-video translation—with temporal coherence and a dedicated face refinement GAN, trained via adversarial and perceptual losses. A key contribution is the integration of a temporal modeling strategy and a forensic detector to distinguish synthesized from real content, along with an open dataset of motion transfer videos. Thorough quantitative and qualitative evaluations, plus ablations, demonstrate improved realism over baselines. Limitations include artifacts from clothing/hair and pose-detection gaps, with future work aiming to broaden pose coverage and improve normalization across camera setups.

Abstract

This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: "Do as I Do" motion transfer: given a YouTube clip of a ballerina (top), and a video of a graduate student performing various motions, our method transfers the ballerina's performance onto the student (bottom). Video: https://youtu.be/mSaIrz8lM1U
  • Figure 2: Our method creates correspondences by detecting poses in video frames (Video to Pose) and then learns to generate images of the target subject from the estimated pose (Pose to Video).
  • Figure 3: (Top) Training: Our model uses a pose detector $P$ to create pose stick figures from video frames of the target subject. We learn the mapping $G$ alongside an adversarial discriminator $D$ which attempts to distinguish between the "real" correspondences $(x_t, x_{t+1}), (y_t, y_{t+1})$ and the "fake" sequence $(x_t, x_{t+1}), (G(x_t), G(x_{t+1}))$ . (Bottom) Transfer: We use a pose detector $P$ to obtain pose joints for the source person that are transformed by our normalization process $Norm$ into joints for the target person for which pose stick figures are created. Then we apply the trained mapping $G$.
  • Figure 4: Face GAN setup. Residual is predicted by generator $G_f$ and added to the original face prediction from the main generator.
  • Figure 5: Transfer results. In each section we show four consecutive frames. The top row shows the source subject and the bottom row shows the synthesized outputs of the target person.
  • ...and 5 more figures