Table of Contents
Fetching ...

LatentKeypointGAN: Controlling Images via Latent Keypoints

Xingzhe He, Bastian Wandt, Helge Rhodin

TL;DR

LatentKeypointGAN is introduced, a two-stage GAN internally conditioned on a set of keypoints and associated appearance embeddings providing control of the position and style of the generated objects and their respective parts providing a new, GAN-based method for unsupervised keypoint detection.

Abstract

Generative adversarial networks (GANs) have attained photo-realistic quality in image generation. However, how to best control the image content remains an open challenge. We introduce LatentKeypointGAN, a two-stage GAN which is trained end-to-end on the classical GAN objective with internal conditioning on a set of space keypoints. These keypoints have associated appearance embeddings that respectively control the position and style of the generated objects and their parts. A major difficulty that we address with suitable network architectures and training schemes is disentangling the image into spatial and appearance factors without domain knowledge and supervision signals. We demonstrate that LatentKeypointGAN provides an interpretable latent space that can be used to re-arrange the generated images by re-positioning and exchanging keypoint embeddings, such as generating portraits by combining the eyes, nose, and mouth from different images. In addition, the explicit generation of keypoints and matching images enables a new, GAN-based method for unsupervised keypoint detection.

LatentKeypointGAN: Controlling Images via Latent Keypoints

TL;DR

LatentKeypointGAN is introduced, a two-stage GAN internally conditioned on a set of keypoints and associated appearance embeddings providing control of the position and style of the generated objects and their respective parts providing a new, GAN-based method for unsupervised keypoint detection.

Abstract

Generative adversarial networks (GANs) have attained photo-realistic quality in image generation. However, how to best control the image content remains an open challenge. We introduce LatentKeypointGAN, a two-stage GAN which is trained end-to-end on the classical GAN objective with internal conditioning on a set of space keypoints. These keypoints have associated appearance embeddings that respectively control the position and style of the generated objects and their parts. A major difficulty that we address with suitable network architectures and training schemes is disentangling the image into spatial and appearance factors without domain knowledge and supervision signals. We demonstrate that LatentKeypointGAN provides an interpretable latent space that can be used to re-arrange the generated images by re-positioning and exchanging keypoint embeddings, such as generating portraits by combining the eyes, nose, and mouth from different images. In addition, the explicit generation of keypoints and matching images enables a new, GAN-based method for unsupervised keypoint detection.

Paper Structure

This paper contains 64 sections, 15 equations, 25 figures, 12 tables.

Figures (25)

  • Figure 1: GANs can generate phot-realistic images (a) but lack local editing capability. LatentKeypointGAN generates images with associated keypoints (a-b), which enables local editing by moving keypoints (c), exchanging appearance (d), removing individual parts (e), and adding one or more parts (f). Our improvements are on the unsupervised learning of an interpretable latent space that disentangles pose and appearance, which makes it easy to use and applicable to diverse domains, including portraits (top row), indoor rooms (bottom row), and persons (see results section).
  • Figure 2: Overview. Starting from noise $\mathbf{z}$, LatentKeypointGAN generates keypoint coordinates, $\mathbf{k}$ and their embeddings $\mathbf{w}$. Cruicial is how they are turned into feature maps that are localized around the keypoints, forming conditional maps for the image generation via SPADE block at different resolutions. At inference time, the position and embedding of keypoints can be edited by the user to control the position and appearance of parts.
  • Figure 3: Location and scale editing. The first column is the source and the last the target. The images in-between are the result of the following operations. First row: pushing the eye keypoint distance from 0.8x to 1.2x. Note that the marked eye keypoints in this row are slightly shifted upward for better visualization. Second row: interpolating the hair keypoint to move the fringe from right to left. Third row: scaling the keypoint location and, therefore, the face from 1.15x to 0.85x. Fourth row: interpolating all keypoint locations, to rotate the head to the target orientation.
  • Figure 4: Our Correlation-based part Disentanglement (CPD) metric is computed as the sum over part correlations. From a pair of images (left) the appearance of parts is exchanged one by one, here for eyes, nose, and mouth (center). The pairwise correlation of the resulting differences maps (right) forms the entries of the correlation matrix.
  • Figure 5: Supported editing features. While all of the tested methods enable global editing, only some offer local editing of part appearance, and non of the GAN-based methods (top three) demonstrated adding new parts, removing, nor animating parts via control handles, as we provide.
  • ...and 20 more figures