Table of Contents
Fetching ...

Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation

Hong Li, Boyu Liu, Xuhui Liu, Baochang Zhang

TL;DR

This work introduces self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information, and proposes a new way to compute keypoints aiming to achieve arbitrary motion control.

Abstract

Face animation deals with controlling and generating facial features with a wide range of applications. The methods based on unsupervised keypoint positioning can produce realistic and detailed virtual portraits. However, they cannot achieve controllable face generation since the existing keypoint decomposition pipelines fail to fully decouple identity semantics and intertwined motion information (e.g., rotation, translation, and expression). To address these issues, we present a new method, Motion Manipulation via unsupervised keypoint positioning in Face Animation (MMFA). We first introduce self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information. Secondly, we propose a new way to compute keypoints aiming to achieve arbitrary motion control. Moreover, we design a variational autoencoder to map expression features to a continuous Gaussian distribution, allowing us for the first time to interpolate facial expressions in an unsupervised framework. We have conducted extensive experiments on publicly available datasets to validate the effectiveness of MMFA, which show that MMFA offers pronounced advantages over prior arts in creating realistic animation and manipulating face motion.

Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation

TL;DR

This work introduces self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information, and proposes a new way to compute keypoints aiming to achieve arbitrary motion control.

Abstract

Face animation deals with controlling and generating facial features with a wide range of applications. The methods based on unsupervised keypoint positioning can produce realistic and detailed virtual portraits. However, they cannot achieve controllable face generation since the existing keypoint decomposition pipelines fail to fully decouple identity semantics and intertwined motion information (e.g., rotation, translation, and expression). To address these issues, we present a new method, Motion Manipulation via unsupervised keypoint positioning in Face Animation (MMFA). We first introduce self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information. Secondly, we propose a new way to compute keypoints aiming to achieve arbitrary motion control. Moreover, we design a variational autoencoder to map expression features to a continuous Gaussian distribution, allowing us for the first time to interpolate facial expressions in an unsupervised framework. We have conducted extensive experiments on publicly available datasets to validate the effectiveness of MMFA, which show that MMFA offers pronounced advantages over prior arts in creating realistic animation and manipulating face motion.
Paper Structure (28 sections, 15 equations, 15 figures, 2 tables)

This paper contains 28 sections, 15 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: An example of MMFA. MMFA can realize realistic motion attribute editing while achieving face animation.
  • Figure 2: Overview of MMFA, where $\copyright$ indicates channel-wise concatenation. Part (a) details the steps by which our model decomposes keypoints. Part (b) describes the expression encoder-decoder structure and self-supervised representation learning. Part (c) shows the structure of the multi-scale generator.
  • Figure 3: Visual comparison of attribute editing between MMFA and Face vid2vid. The second column displays the semantic images corresponding to the neutral keypoints $N$ defined as keypoints with no rotation, translation, and expression correction. It is important to note that in MMFA, the neutral keypoints have been scaled according to the canonical keypoints, while in Face vid2vid, the canonical keypoints are also the neutral keypoints. Subsequent columns show the semantic images obtained after applying translations $t$, rotations $R$, and expression deformations $\delta$.
  • Figure 4: Illustration of the latent VAE training.
  • Figure 5: Visual comparisons with state-of-the-art methods.
  • ...and 10 more figures