Table of Contents
Fetching ...

Do As I Do: Pose Guided Human Motion Copy

Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, Kui Ren

TL;DR

The paper tackles pose-guided human motion copy by introducing FakeVideo, a framework that combines a GW-enhanced pose-to-appearance generator, perceptual losses, an episodic memory module, and a sequence-aware foreground generation with a spatial-temporal discriminator. It also implements self-supervised face refinement via a face orientation vector field and refines key parts with multi-local GANs, while separating foreground and background for stable training. Evaluations on five datasets (iPER, ComplexMotion, SoloDance, Fish, Mouse) show state-of-the-art PSNR and FID improvements and demonstrate the method’s ability to generalize to non-human articulated objects. The approach advances realism and temporal coherence under limited data, with robust ablations, a user study, and practical time-efficiency considerations for inference.

Abstract

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

Do As I Do: Pose Guided Human Motion Copy

TL;DR

The paper tackles pose-guided human motion copy by introducing FakeVideo, a framework that combines a GW-enhanced pose-to-appearance generator, perceptual losses, an episodic memory module, and a sequence-aware foreground generation with a spatial-temporal discriminator. It also implements self-supervised face refinement via a face orientation vector field and refines key parts with multi-local GANs, while separating foreground and background for stable training. Evaluations on five datasets (iPER, ComplexMotion, SoloDance, Fish, Mouse) show state-of-the-art PSNR and FID improvements and demonstrate the method’s ability to generalize to non-human articulated objects. The approach advances realism and temporal coherence under limited data, with robust ablations, a user study, and practical time-efficiency considerations for inference.

Abstract

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.
Paper Structure (22 sections, 11 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 22 sections, 11 equations, 10 figures, 6 tables, 2 algorithms.

Figures (10)

  • Figure 1: In the training stage, we extract the poses from the given video frames of a target person and feed the poses into the model, which generates the video frames of the target person. In the inference stage, we input the desired poses, which may be extracted from a video of a source person, and input them to the trained model to generate the frames of the target person.
  • Figure 2: A high-level overview of our method. (a) For data pre-processing, we extract the pose sequence of the source video and separate the background from the target video. (b) We first generate the appearance sequence guided by the poses. Then the local body parts are enhanced with face enhancement and multi-local GANs. Finally, the refined appearance (foreground) and the background are coupled into a frame. (c) We train the appearance generator utilizing a Gromov-Wasserstein loss, a perceptual loss and an adversarial loss.
  • Figure 3: Pose-to-appearance generator and Gromov-Wasserstein loss. Top: We adopt an encoder-decoder architecture with dense skip connections that facilitate the fusion of features across scales. Bottom: The Gromov-Wasserstein loss is introduced to guide the pose-to-appearance generation.
  • Figure 4: Face orientation is extracted from the face vector field. Three different face orientations are presented in the figure. Specifically, we employ six vectors, including $v_{1}$: right eye $\rightarrow$ left eye, $v_2$: left eye $\rightarrow$ nose, $v_3$: right eye$\rightarrow$ nose, $v_4$: right ear$\rightarrow$ left ear, $v_5$:nose$\rightarrow$ right ear, $v_6$: nose$\rightarrow$ left ear, to characterize the face orientation.
  • Figure 5: Self supervised face enhancement with vector field. We select $m$ faces from the target face pool with the most similar face orientation to $f$. The selected similar faces are expected to facilitate detailed face texture generation.
  • ...and 5 more figures